Overview and Importance of Data Quality for Machine Learning Tasks

Overview and Importance of Data Quality
for Machine Learning Tasks
KDD Tutorial / © 2020 IBM Corporation
Nitin Gupta Shashank
Mujumdar
Shazia Afzal Hima Patel
IBM Research India

Acknowledgements
Abhinav Jain Ruhi Sharma
Sameep Mehta
Lokesh N
Shanmukh Guttula Vitobha Munigala

Agenda
Why is data quality measurement important for Machine Learning Applications?
Part I : Discussion on quality metrics for structured/tabular data
Part II : Discussion on quality metrics for unstructured text data
Part III : Discussion on Human In Loop for Data Quality Measurements
Conclusion

Need for Data Quality for Machine
Learning

Data Preparation in Machine Learning
“Data collection and preparation are typically
the most time-consuming activities in developing
an AI-based application, much more so than
selecting and tuning a model.” – MIT Sloan Survey
https://sloanreview.mit.edu/projects/reshaping-business-with-artificial-
intelligence/
Data preparation accounts for about 80% of the work of data
scientists” - Forbes
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-
most-time-consuming-least-enjoyable-data-science-task-survey-
says/#70d9599b6f63

Challenges with Data Preparation

Challenges with Data Preparation
Iterative debugging which is cumbersome and time consuming

Data Quality Analysis can help..
▪ Know the issues in the data
beforehand, example noise in labels,
overlap between classes.
▪ Make informed choices for data
preprocessing and model selection
▪ Reduce turn around time for data
science projects.

Different personas in enterprise setting..
Source : Google Images

To put it all together
Data Assessment and Readiness Module
▪ Need for algorithms that can assess training datasets
▪ Across modalities.. Structured/unstructured/timeseries etc
▪ Allow for complex interaction between the different personas and human in loop techniques
▪ Need for automation

To summarize:
George Fuechsel, IBM 305 RAMAC technician
Lot of progress in last several years on improving ML
algorithms including building automated machine
learning toolkits (AutoML)
However,
Quality of a ML model is directly proportional to
Quality of Data
Hence, there is a need for systematic study of
measuring quality of data with respect to machine
learning tasks.

Data Quality Metrics for
Structured Data

Data Quality Metrics
▪ We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations
Source: https://www.analyticsinsight.net/data-literacy-helping-enterprises-lead-with-data-through-challenging-times/

Data Cleaning
▪ What are some common data cleaning techniques used for machine learning?
▪ Do data cleaning techniques always help in building a machine learning pipeline
▪ Joint cleaning and model building techniques

Common Data Cleaning Techniques
Data Cleaning Issues
Quantitative Qualitative
Integrity Constraints based errors (not covered)
- Functional Dependency Constraints
- Denial Constraints
- Others ..
- Missing Values
- Outliers
- Noise in the data
- Noise in labels
- Data Linters
- …

Is data cleaning always helpful for ML pipeline?
▪ Question: what is the impact of different cleaning methods on
ML models?
▪ Experiment setup:
▪ Datasets: 13 real world datasets
▪ Five error types: outliers, duplicates, inconsistencies,
mislabels and missing values
▪ Seven classification algorithms: Logistic Regression,
Decision Tree, Random Forest, Adaboost, XGBoost, k-
Nearest Neighbors, and Naïve Bayes
[LXJ19]
Cleaning methods used for error types
Image Source: Li et al, 2019. CleanML: A Benchmark for Joint Data
Cleaning and Machine Learning [Experiments and Analysis]

Insights: Impact of different cleaning techniques
[LXJ19]
Impact, if cleaned* Negative Impacts*
Inconsistencies Insignificant No
Outliers Insignificant/ Positive Possible
Duplicates Insignificant Yes
Missing Values Positive If imputation is far away
from ground truth
Mislabels Positive If dataset accuracy < 50%
*based on general patterns across the datasets and different train test settings

In conclusion
▪ Data cleaning does not necessarily improve the quality of downstream ML models
▪ Impact depends on different factors:
▪ Cleaning algorithm and its set of parameters
▪ ML model dependent
▪ Order of cleaning operators
▪ Model selection and cleaning algorithm can increase robustness of impacts – no one
solution!
[LXJ19]
[LAU19]

Data Quality Metrics
We will cover the following topics:
▪ Data Cleaning
▪ Class Imbalance
▪ Label Noise
▪ Data Valuation
▪ Data Homogeneity
▪ Data Transformations

Class Imbalance
Unequal distribution of classes within a dataset
Source: https://towardsdatascience.com/credit-card-fraud-detection-a1c7e1b75f59

Why it happens?
Expected in domains where data with one pattern is more common than other.
Source: Krawczyk et al, 2016. Learning from imbalanced data: open challenges and future direction
[Kra16]

Why Imbalanced Classification is Hard?
Classifier assumes data to
be balanced
Unequal Cost of
Misclassification Errors:
False negatives are
important than False
positives
Minority Class is more
important for data mining
but given less priority by
learning algorithm
Minority class instances
can be detected as noise

Evaluation Metrics for Imbalanced Datasets
Accuracy Paradox
90%
Majority
10%
Minority
Learning
Algorithm
Always
Predict
Accuracy = 90% Reasonable ?
F1 Score, Recall, AUC PR, AUC ROC, G-Mean….
Source: https://en.wikipedia.org/wiki/Accuracy_paradox

Factors affecting class imbalance
▪ Imbalance Ratio
▪ Overlap between classes
▪ Smaller sub-concepts
▪ Dataset Size
▪ Label Noise
▪ Combination
▪ …
KDD Tutorial / © 2020 IBM Corporation 24

Affecting Factor: Imbalance Ratio
Imbalance Ratio
Learning algorithm biased towards majority class
Minority sample considered as the noise
In confusion region, priorities given to majority
Increase Increase
➔ ➔
Is the imbalance ratio only cause of performance
degradation in learning from imbalanced data? NO
➔
Source: . https://towardsdatascience.com/sampling-techniques-for-extremely-imbalanced-data-281cc01da0a8
[WP03]
[GBSW04]

Affecting Factor: Overlap
Priority given to majority class in the overlapping region
Minority data points are considered as noise due to a
smaller number of data points from minority class in the
overlapping region
Source: Ali et al, 2015. Classification with class imbalance problem: A Review
[ALI15]

Affecting Factor: Smaller sub-concepts
It is difficult to discern separate groupings
with so few examples in minority class
[ALI15]

Affecting Factor: Dataset Size
Size
Deteriorate the evaluation
measure
Underlying structure of the
minority class distributions
is difficult to understand
➔ ➔
[ALI15]

Affecting Factor: Combined Effect
Combined effect of varying dataset
size, overlap, and imbalance ratio
Source: Misha et al, 2010. Overlap versus Imbalance
[DT10]

Modelling Strategies: Types
Our Focus
Source: Paula et al, 2015. A Survey of Predictive Modelling under Imbalanced Distributions
[BTR15]

Resampling Techniques
• Oversampling
• Undersampling
• Ensemble Based Techniques
Does imbalance recovery method always
help?
OR
Does the Impact of imbalance recovery
method same on all datasets.

Bayes Impact index
Totally Overlapped: No
Impact
Separable: No Impact Partially Overlapped:
High Impact
Source: Yang et al, 2019. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem
[LCT19]

Label Noise
Given Label – Iris-setosa
Correct Label –Iris-virginica (based on attributes analysis)
▪ Most of the large data generated or annotated
have some noisy labels.
▪ In this metric we discuss “How one can identify
these label errors and correct them to model
data better?”.
There are atleast 100,000 label issues is ImageNet!
Source: https://l7.curtisnorthcutt.com/confident-learning

Effects of Label Noise
Possible Sources of Label Noise:
▪ Insufficient information provided to the labeler
▪ Errors in the labelling itself
▪ Subjectivity of the labelling task
▪ Communication/encoding problems
Label noise can have several effects:
▪ Decrease in classification performance
▪ Pose a threat to tasks like feature selection
▪ In online settings, new labelled data may
contradict the original labelled data

Label Noise Techniques
‘
Label Noise
Algorithm Level
Approaches
Data Level Approaches
▪ Designing robust algorithms that are
insensitive to noise
▪ Not directly extensible to other learning
algorithms
▪ Requires to change an existing
method, which neither is always
possible nor easy to develop
▪ Filtering out noise before passing to
underlying ML task
▪ Independent of the classification
algorithm
▪ Helps in improving classification
accuracy and reduced model
complexity.

Label Noise Techniques
‘
Label Noise
Algorithm Level
Approaches
Data Level Approaches
▪ Learning with Noisy Labels (NIPS-2014)
▪ Robust Loss Functions under Label Noise
for Deep Neural Networks (AAAI-2017)
▪ Probabilistic End-To-End Noise Correction
for Learning With Noisy Labels (CVPR-2019)
▪ Can Gradient Clipping Mitigate Label Noise?
(ICLR-2020)
▪ Identifying mislabelled training data (Journal
of artificial intelligence research 1999)
▪ On the labeling correctness in computer
vision datasets (IAL 2018)
▪ Finding label noise examples in large scale
datasets (SMC 2017)
▪ Confident Learning: Estimating Uncertainty
in Dataset Label (Arxiv -2019)

Filter Based Approaches
▪ On the labeling correctness in
computer vision datasets
[ARK18]
▪ Published in Interactive
Adaptive Learning 2018
▪ Train CNN ensembles to
detect incorrect labels
▪ Voting schemes used:
▪ Majority Voting
▪ Identifying mislabelled training
data [BF99]
▪ Published in Journal of
artificial intelligence research,
1999
▪ Train filters on parts of the
training data in order to
identify mislabeled examples
in the remaining training data.
▪ Voting schemes used:
▪ Majority Voting
▪ Consensus Voting
▪ Finding label noise examples in large
scale datasets [EGH17]
▪ Published in International
Conference on Systems, Man, and
Cybernetics, 2017
▪ Two-level approach
▪ Level 1 – Train SVM on the
unfiltered data. All support
vectors can be potential
candidates for noisy points.
▪ Level 2 – Train classifier on the
original data without the support
vectors. Samples for which the
original and the predicted label
does not match, are marked as
label noise.
Source: Broadley et al, 1999, Identifying mislabeled training Data

Eliminating Class Noise in
Large Datasets – ICML 2003
Criterion for a rule to be GR:
▪ It should have classification precision on its own
partition above a threshold
▪ It should have minimum coverage on its own
partition
▪ coverage - how many examples rules
classifies (either noise or not) confidently -
that is its coverage.
Source: Zhu et al, 2003. Eliminating Class Noise In Large Datasets
[ZWC03]

Confident Learning: Estimating
Uncertainty in Dataset Label -2019
The Principles of Confident Learning
▪ Prune to search for label errors
▪ Count to train on clean data
▪ Rank which examples to use during training
In case of Label Collison
Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label
[NJC19]

Confident Learning: Estimating
Uncertainty in Dataset Label -2019
Source: Northcutt et al, 2019. Confident Learning: Estimating Uncertainty in Dataset Label
[NJC19]

Data Valuation
Value of a training datum is how much impact it has in the predictor performance.
Prediction
Model
Valuation
Model
Training Data
Validation Data
Data Valuation
High Valued
Low Valued
Source: Google Images

Is the impact
of all the cat
images from
training set
same?
𝑃 , 𝐴, 𝑉 == 𝑃( , 𝐴, 𝑉)
𝑨: 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎
𝑽: 𝑽𝒂𝒍𝒊𝒅𝒂𝒕𝒊𝒐𝒏 𝑺𝒆𝒕
𝒙: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑫𝒂𝒕𝒖𝒎
𝑷 𝒙, 𝑨, 𝑽 : 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝒖𝒔𝒊𝒏𝒈 𝒙
𝑽𝒂𝒍 𝒙, 𝑨, 𝑽 : 𝑰𝒎𝒑𝒂𝒄𝒕 𝒐𝒇 𝒙 𝒊𝒏 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽
𝑉𝑎𝑙( , 𝐴, 𝑉) == 𝑉𝑎𝑙( , 𝐴, 𝑉)
?
Is
Then

Scenarios in which datum has low value:
▪ Incorrect label
▪ Input is noisy or low quality
▪ Usefulness for target task
Label: Dog Label: Dog Label: Dog Label: Dog
Validation Set Source: Google Images

Application
▪ Identifying Data Quality
▪ High value data ➔ Significant contribution
▪ Low value data ➔ Noise or outliers or not useful for task
▪ Domain Adaptation
▪ Data Gathering

Leave One Out
𝑨: 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝑨𝒍𝒈𝒐𝒓𝒊𝒕𝒉𝒎
𝑽: 𝑽𝒂𝒍𝒊𝒅𝒂𝒕𝒊𝒐𝒏 𝑺𝒆𝒕
𝑻: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑺𝒆𝒕
𝒙: 𝑻𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝑫𝒂𝒕𝒖𝒎
𝑷 𝒙, 𝑨, 𝑽 : 𝑷𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽 𝒖𝒔𝒊𝒏𝒈 𝒙
𝑽𝒂𝒍 𝒙, 𝑨, 𝑽 : 𝑰𝒎𝒑𝒂𝒄𝒕 𝒐𝒇 𝒙 𝒊𝒏 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆 𝒐𝒇 𝑨 𝒐𝒏 𝑽
𝑉𝑎𝑙 𝑥, 𝐴, 𝑉 = 𝑃 𝑇, 𝐴, 𝑉 − 𝑃(𝑇 − 𝑥, 𝐴, 𝑉)
Value of training datum 𝑥
Performance of learning algorithm 𝐴 on validation set
𝑉 when trained on full data 𝑻 including 𝑥
Performance of learning algorithm 𝐴 on validation set
𝑉 when trained on full data 𝑻 excluding 𝑥
[Coo77]
Source: Cook et al, 1977. Detection of influential observation in linear regression

Example
𝑉𝑎𝑙( , 𝐴, 𝑉) = 𝑃( , 𝐴, 𝑉) − 𝑃( , 𝐴, 𝑉)
0.032 = 0.80 – 0.768
[Coo77]

Implications
➢ Doesn’t capture complex interactions between subsets of data
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
𝑉𝑎𝑙( , 𝐴, 𝑉) = 0
Reasonable ?
[Coo77]

Data Shapley
Valuation of a datum should come
from the way it coalesces with
other data in the dataset
Valuation of is derived from its
marginal contribution to all other
subsets in the data
[GZ19]
Source: Ghorbani et al, 2019. Data shapley: Equitable valuation of data for machine learning

Shapley Values – Formulation
𝜙 𝑖 = 𝐶 ෍
𝑆⊆𝐷{𝑖}
𝑣(𝑆 ∪ {𝑖}) − 𝑣(𝑆)
𝑛−1
𝑆
Constant
value of 𝑖 together with 𝑆
Shapley value of datum 𝑖
value of 𝑆 alone
Iterates over all subsets in data
[GZ19]

Practical Implications of Data Shapley
▪ Computation of Shapley value of a datum requires 2 𝐷 −1
summations
▪ In practice, we approximate Shapley values using Monte Carlo simulations
▪ The number of simulations required doesn’t scale with amount of data
[GZ19]

Results using Data Shapley
The data points that have higher
noise receive lesser Shapley values
[GZ19]

DVRL
Reward = Performance(selected data)
–
performance(complete data)
Data valuation is a function
from datum to its value
Sample data based on valuation
Build target model
using selected data
[YAP19]
Source: Yoon et al, 2019. Data Valuation using Reinforcement Learning

DVRL Results
▪ Removing high valued points affect the model performance the most
▪ Removing low valued points affect the model performance the least
Source: Yoon et al, 2019. Data Valuation using Reinforcement Learning
[YAP19]

Data Homogeneity
Homogeneity
Syntactic
Distance
matching
Pattern
matching
Semantic
Embedding
matching
Context
matching
We call a data homogenous, if all the entries follow a unique pattern

Data Homogeneity
12th Jan, 2020
Account Manager
12/01/2020
Acc. Manager
12th Jan, 2020
Account Manager
12/01/2020
Acc. Manager
Homogenous Homogenous
In-homogenous
➢ Two different date formats
➢ Two different values for the same category

In-homogeneity affects ML pipeline
▪ Adult Census dataset downloaded from Kaggle
▪ Task is to predict Income level (>50k/<=50k) given several attributes of a person
Income level (Train) Income level (Test)
<=50K <=50K.
<=50K <=50K.
>50K >50K.
<=50K >50K.
<=50K <=50K.
<=50K <=50K.
>50K <=50K.
>50K >50K.
4 output neurons?
Source: Image from Google Images

Inhomogeneity causes
When data is gathered
by different people
In the absence (or
weak presence) of a
data collection
protocol
When data is merged
from different sources
When data is stored in
different formats (e.g.
.csv, .xlsx) etc.
Inhomogeneity

Homogeneity score
Desiderata for measures that characterize homogeneity
▪ Lesser the number of clusters, better the homogeneity
▪ More skewed the distribution of clusters, better the homogeneity
Homogeneity( ) < Homogeneity( )
Homogeneity ( ) < Homogeneity( )
[DKO+07]
Source: Dai et al, 2006. Column heterogeneity as a measure of data quality

Homogeneity score
Entropy Simpson’s biodiversity Index
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑝 = ෍ 𝑝 𝑥 ln 𝑝(𝑥)
▪ Satisfies both the desiderata
▪ Not bounded 𝑅+
▪ Satisfies both the desiderata
▪ Bounded in [0,1]
𝑃 𝐶 𝑥𝑖 ≠ 𝐶(𝑥𝑗)
𝑃 is the probability distribution of the
entries among clusters
𝑥_𝑖 and 𝑥_𝑗 are two entries and
𝐶(. ) denotes the cluster index
[sim]
Source : http://www.countrysideinfo.co.uk/simpsons.htm

Syntactic Homogeneity
Distance based similarity
Edit distance
Sequence based distance measure
d a v e
0 1 2 3 4
d 1 0 1 2 3
v 2 1 1 1 2
a 3 2 1 2 2
y0 y1 y2 y3 y4
x = d – v a
y = d a v e
substitute a with e
insert a (after d)
𝐽 𝑥, 𝑦 =
|𝐵𝑥 ∩ 𝐵𝑦|
|𝐵𝑥 ∪ 𝐵𝑦|
▪ Eg. 𝑥 = dave, 𝑦 = dav
▪ 𝐵_𝑥 = {#𝑑, 𝑑𝑎, 𝑎𝑣, 𝑣𝑒, 𝑒#},
𝐵_𝑦 = {#𝑑, 𝑑𝑎, 𝑎𝑣, 𝑣#}
▪ 𝐽(𝑥, 𝑦) = 3/6
Jaccard Similarity
Set based distance measure
Source: Doan et al, 2012. Principles of Data Integration

Syntactic Homogeneity
Pattern based similarity (Regex based clustering)
Ram Kumar.
u{1}l{3} s{1} u{1}l{4} .{1}
w s{1} w .{1}
ROOT
Lakshman Kumar.
u{1}l{7} s{1} u{1}l{4} .{1}
w s{1} w .{1}
ROOT
Larger the alphabet set of the regex language, more the number of clusters detected
Source: Doan et al, 2012. Principles of Data Integration

Semantic Homogeneity
Embeddings based similarity
▪ Applicable when the entries in the data are
meaningful English words
▪ Can use off the shelf embeddings like word2vec,
Glove etc.
▪ 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑥, 𝑦 = 𝑐𝑜𝑠𝑖𝑛𝑒 𝜙 𝑥 , 𝜙 𝑦 ;
▪ 𝜙 . = 𝑤𝑜𝑟𝑑 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔
▪ Embeddings capture semantic information as well
[MCCD13]
Source: Mikolov et al, 2012. Efficient estimation of word representations in vector space

Semantic Homogeneity
Context based similarity
▪ Distributional hypothesis : words that are
used and occur in the same context tend to
purport similar meanings
▪ In Structured dataset, word ≈ an entry in a row,
context ≈ remaining entries in the row
▪ We may not be able to use off the shelf
embeddings
▪ Need to train dataset specific embeddings
▪ Entries that have similar neighborhood in their
rows tend to show higher similarities
▪ Useful in cases where
𝑠𝑖𝑚 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒, 𝑚𝑎𝑛𝑎𝑔𝑒𝑟 being high is not
desirable in a dataset
Source: Mikolov et al, 2012. Efficient estimation of word representations in vector space
[MCCD13]

Data Transformation
▪ The goal is to allow the business user to transform provided data (heterogeneous) into user intended
format (homogeneous), by showing a few examples of expected output for the given input samples.

Data Transformation Examples

Data Transformation Using
PbE Systems
Programming-by-Example systems capture user intent using sample input-output example pairs and learn
transformation programs that convert inputs into their corresponding outputs.
User provided input-
output example pairs
Program space
expressed using a
Domain-specific
Language (DSL)
Decode/Search for
Transformation
Programs
Transformation Program
Fig 2. Generic process across all Programming-by-Example frameworks

Domain-specific Language
DSL of a PbE system consists of a finite set of atomic functions/operators that is used to solve domain-specific
tasks and formally represents the transformation program for the user to interpret.
Source: Bunel et al, 2018. Karel DSL to guide an agent inside a grid world
Source: Gulwani et al, 2011. Flash-fill DSL for string to string transformation

What Are The Different Types Of
Programming-by-example Frameworks?
DSL of a PbE system consists of a finite set of atomic functions/operators that is used to solve domain-specific
tasks and formally represents the transformation program for the user to interpret.
Symboli
c
Statistical
Hybrid
❑ Uses statistical methods to smartly
select sub-programs for further
exploration. Requires additional
efforts to incorporate branching
decisions
❑ No way to retract any wrong
decision made while exploration.
We might eliminate correct
program during the process.
❑ Examples: NGDS, PL meets ML
❑ Designed to produce correct programs by
construction using logical reasoning and
domain-specific language
❑ Produces the intended program with few input-
output examples
❑ Requires significant engineering effort and an
underlying program search method
❑ Examples: PROSE
❑ Use of Neural Networks to decode transformation
directly from encoded input-output string pairs.
❑ Generated program may not satisfy the provided
specification.
❑ The programs are generated using beam search
❑ Requires a lot of training data, generated randomly,
which might result in poor generalizability on real-
world tasks
❑ Examples: Robust-Fill, NPBE, Deep-Coder

Flash Meta – A Framework for
Inductive Program Synthesis (2015)
▪ Consider the transformation “(425) 706-7709” --> “425”.
▪ The output string ”425” can be represented in 4 possible ways
as concatenation of substrings ‘4’, ‘2’, ‘5’, ‘42’, ‘25’.
▪ Leaf nodes e1, e2, e3, e12, e23 are VSAs to output substrings
‘4’, ‘2’, ‘5’, ‘42’, ‘25’ respectively.
Key aspects for symbolic systems:
▪ Independence of problems:
▪ Number of possible sub-problems is quadratic:
▪ Enumerative Search:
Drawback: Takes a lot of time to look for correct
programs in the transformation space
Source: Polozov et al, 2015. Flash Meta – A Framework for Inductive Program Synthesis
[PG15]
Source: Polozov et al, 2015. FlashMeta –A FrameworkforInductiveProgram Synthesis

Hybrid Systems
To overcome the issue of symbolic models, hybrid systems have been proposed which
▪ use statistical models to generate probability distribution over DSL operators at each step of the
partially-constructed AST
▪ use most-likely ones to guide deductive search for generating transformation programs.
Program Search: System uses deductive search to look for programs that satisfy input-output transformation for all
user provided examples..
Program Ranking: Ranking function is used to pick top programs from the set of valid programs based on
properties like generalizability, simplicity etc.
Caveat: No way to go back when wrong decisions are made. We might
end up with sub-optimal ones or eliminating valid programs completely

Statistical System: Robust Fill
▪ DL approach based on seq-to-seq encoder-decoder networks which perform as good as conventional
non-DL based PbE approaches like Flash-Fill but with the added motivation of having the learned model
to be more robust to noise in the input data.
Domain Specific Language
Source: Devlin et al, 2017. Robust-Fill
[DUB+17]

Statistical System:
Key aspects:
▪ Use of neural networks with no symbolic guidance unlike hybrid systems.
▪ Use of Beam-Search to decode/predict top-k most likely transformation programs based on joint
probability of decoded DSL operators
▪ Use of synthetic datasets for model training which might lead to poor generalizability in real-world
tasks.
▪ Caveat: No guarantee whether the synthesized program will compile or satisfy user-provided
specifications
[DUB+17]

Statistical System:
Solution: Execution-guided Synthesis
▪ During Beam-Search, execute the partially constructed transformation program on the go and eliminate
ones which produce syntax errors or empty outputs
▪ Restrict DSL vocabulary at each step to enforce grammar correctness
▪ Incorporate execution guidance during training and testing so that program correctness is guaranteed by
construction.
Source: Wang et al, 2018. Execution-Guided Neural Program Synthesis
[KMP+18]

Data Quality Metrics for
Unstructured Text Data

Outline
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps

What is Unstructured Text?
Html/Xml
Documents
Tweets Ticket Data
Email
Product Reviews

How to Assess Quality of Text?
Different properties of text can be measured to quantify its quality!
▪ Lexical Properties
- Vocabulary, Misspellings etc.
▪ Syntactic Properties
- Grammatical Relations, Syntax Violations etc.
▪ Semantic Properties
- Text Meaningfulness, Text Complexity etc.
▪ Discourse Properties
- Readability, Text Coherence etc.
▪ Distributional Properties
- Outliers, Topics, Embeddings

How to Assess Quality of Text?
Different Approaches can be further broadly classified as
▪ Generic Text quality
▪ Task specific
Generic Text Quality Approaches:
▪ Text Readability
▪ Text Coherence
▪ Text Formality
▪ Text Ambiguity
▪ Text Outliers
Task Specific Approaches:
▪ Text Classification
▪ Text Generation
▪ Semantic Similarity
▪ Dialog Systems
▪ Label Noise
▪ Lexical Diversity
▪ Text Cleanliness
▪ Text Appropriateness
▪ Text Complexity
▪ Text Bias

Outline
▪ Motivation
▪ What is Unstructured Text?
▪ How to Assess Text Quality?
▪ Prior-Art Approaches
▪ Generic Text Quality Approaches
▪ Text Quality Metrics for Outlier Detection
▪ Text Quality Metrics for Corpus Filtering
▪ Quality Metrics for Text Classification
▪ Future Directions
▪ Next Steps

Automatic Text Scoring Using Neural Networks
Objective:
Automatically assign a score to the text given a marking scale.
Approach:
1. Train a Model to learn word embeddings that capture both context and word-specific scores
2. Train an LSTM-based network using the score-specific word embeddings to output the text score
Task:
Predict the Essay scores written by students ranging from Grade 7 to Grade 10
[AYR16]
Source : Alikaniotis et al, 2016. Automatic text scoring using neural networks

Score-Specific Word Embeddings:
▪ Learn both context and word specific scores during training.
▪ Context:
▪ For a sequence of words S = (w1,…, wt,…, wn), construct S´ = (w1,…, wc,…, wn | wc ∈ 𝕍, wt ≠ wc)
▪ The model learns word embeddings by ranking the activation of the true sequence S higher than the activation of its
‘noisy’ counterpart S´ through a hinge margin loss.
▪ Word Specific Scores:
▪ Apply linear regression on the sequence embedding to predict the essay score
▪ Minimize the mean squared error between the predicted ŷ and the actual essay score y
[AYR16]

Comparison:
▪ Standard Embeddings only consider context and hence place correct and incorrect spellings close to each other.
▪ Score-Specific Embeddings discriminate between spelling mistakes based on low and high essay scores, thus
separating them in the embedding space while preserving the context
[AYR16]

Explainability:
▪ Which words in the input text correspond to a high score vs a low score?
▪ Perform a single pass of an essay from left to right and let the LSTM make its score prediction
▪ Provide a pseudo-high score and a pseudo-low score as ground truth and measure the gradient
magnitudes corresponding to each word
[AYR16]

Visualizations:
▪ Positive feedback for correctly placed punctuation or long-distance dependencies (Sentence 6 “are … researching”)
▪ Negative feedback for POS mistakes (Sentence 2, “Being patience”)
▪ Incorrect negative feedback (Sentence 3, “reader satisfied”)
[AYR16]

Transparent text quality assessment with
convolutional neural networks
Objective:
Automatically assign a score to the input Swedish text
Approach:
1. Train a CNN with residual connections on character embeddings
2. Utilize a pseudo-probabilistic framework to incorporate axioms on varying quality text vs high
quality text to train the model
Task:
1. Predict the Essay scores and compare against human graders
2. Qualitatively Analyze text to identify low vs high scoring regions
KDD Tutorial / © 2020 IBM Corporation Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks
[ÖG17]

Training:
▪ For a sequence of characters S = (s1, s2,…, sn), construct the character embedding matrix 𝕍 x d
▪ Pass the character embedding matrix through the CNN
▪ Final quality score is the dot product of the output weight vector and the mean value of the outputs at the
final residual layer.
▪ Utilise pseudo-probabilistic framework P(a > b) = σ(q(a) − q(b)), where σ(x) is the logistic function and q(x)
is the network score
[ÖG17]
Source : Östling et al, 2017. Transparent text quality assessment with convolutional neural networks

Axioms:
1. All authors (professional or not) are consistent
2. Professional authors are better than blog authors
3. All professional authors are equal
4. Blog authors are not equal
[ÖG17]

Explainability:
▪ Which characters in the input text correspond to a high score vs a low score?
▪ Red encodes low scores, blue encodes high scores. Faded colors are used for scores near zero.
▪ News examples are scored higher than the blog examples
▪ Informal spellings, use of smileys leads to low scores in the blog text
[ÖG17]

Outlier Detection
Outlier in a tabular or timeseries data can be
interpreted as a
▪ Deviating point
▪ Noise
▪ Rarely occurring point
What can be an outlier in text data?
▪ Topically diverse sample?
▪ Gibberish text?
▪ Meaningless sentences?
▪ Samples from other language?
Images Credits: Google Images
Figure 1 Figure 2
Figure 3

Outlier Examples in Text Data
▪ Repetitive: greatest show ever mad full stop greatest show ever mad full stop greatest show ever mad
full stop greatest show ever mad full stop
▪ Incomprehensible: lived let idea heck bear walk never heard whole years really funny beginning went
hill quickly
▪ Incomplete: Suspenseful subtle much much disturbing
Outlier Detection
Classical techniques
Matrix factorization
Techniques
DL techniques
Self-attention
techniques for text
representation
Outlier Detection in Text Data

Outlier Detection for Text Data
▪ Feature Generation – Simple Bag of Words
Approach
▪ Apply Matrix Factorization – Decompose the
given term matrix into a low rank matrix L and
an outlier matrix Z
▪ Further, L can be expressed as
▪ where,
▪ The l2 norm score of a particular column zx
serves as an outlier score for the document
KDD Tutorial / © 2020 IBM Corporation Source : Kannan et al, 2017. Outlier detection for text data
[KWAP17]

Outlier Detection for Text Data
▪ For experiments, outliers are sampled from a unique class from standard datasets
▪ Receiver operator characteristics are studied to assess the performance of the
proposed approach.
▪ Approach is useful at identifying outliers even from regular classes.
▪ Patterns such as unusually short/long documents, unique words, repetitive vocabulary
etc. were observed in detected outliers.
[KWAP17]
Source : Kannan et al, 2017. Outlier detection for text data

Unsupervised Anomaly Detection in Text
Multi-Head Self-Attention:
Sentence Embedding
Attention matrix
▪ Proposes a novel one-class classification method which leverages pretrained word
embeddings to perform anomaly detection on text
▪ Given a word embedding matrix H, multi-head self-attention is used to map sentences of
variable length to a collection of fixed size representations, representing a sentence with
respect to multiple contexts
Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text
[RZV+19]
(1)
(2)

Unsupervised Anomaly Detection on Text
CVDD Objective
Context Vector
Orthogonality Constraint
▪ These sentence representations are trained along with a collection of context vectors such that
the context vectors and representations are similar while keeping the context vectors diverse
▪ Greater the distance of mk(H) to ck implies a more anomalous sample w.r.t. context k
Outlier Scoring
[RZV+19]
Source : Ruff et al 2019. Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text
(1)

More Generic Text Quality Approaches
Local Text Coherence:
▪ [MS18] Mesgar et. al. "A neural local coherence model for text quality assessment." EMNLP, 2018.
Text Formality:
▪ [JMAS19] Jain et. al. "Unsupervised controllable text formalization." AAAI, 2019.
Text Bias:
▪ [RDNMJ13] Recasens et. al. "Linguistic models for analyzing and detecting biased language." ACL,
2013.

Identifying Non-Obvious Cases in Semantic
Similarity Datasets
Objective:
▪ Identify obvious and non-obvious text pairs from semantic similarity datasets
Approach:
1. Compare the Jenson-Shannon Divergence (JSD) of the text snippets against the median JSD of
the entire distribution
2. Classify the text snippets based on divergence from the median
Task:
▪ Analyze model performance numbers on SemEval Community Question Answering for both
obvious and non-obvious cases
KDD Tutorial / © 2020 IBM Corporation Source : Peinelt et al 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets
[PLN19]

Identifying Non-Obvious Cases in Semantic
Similarity Datasets
Lexical divergence distribution by labels across datasets:
▪ The datasets differ with respect to the degree of lexical divergence
they contain
▪ Pairs with negative labels tend to have higher JSD scores than pairs
with positive labels
▪ Model performance numbers TPR/TNR/F1 score should be higher on
non-obvious cases compared to obvious cases.
Examples of Obvious and Non-
Obvious Semantic Text Pairs:
[PLN19]
Source : Peinelt et al 2019. Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets

Improving Neural Conversational Models
with Entropy-Based Data Filtering
Objective:
▪ Filter generic utterances from dialog datasets to improve model performance
Approach:
▪ Device a data filtering by identifying open-ended inputs determined by calculating the entropy of the
distribution over target utterances
Task:
▪ Measure custom metric scores such as response length, KL divergence, Coherence, Bleu score
etc. for standard dialog datasets based on source/target/both filtering mechanisms
KDD Tutorial / © 2020 IBM Corporation Source: Csaky et al, 2019. Improving neural conversational models with entropy-based data filtering
[CPR19]

Improving Neural Conversational Models
with Entropy-Based Data Filtering
Examples:
1. Source = “What did you do today?”
• has high entropy, since it is paired with many different responses in the data
2. Source = “What color is the sky?”
• has low entropy since it’s observed with few responses in the data
3. Target = “I don’t know”
• has high entropy, since it is paired with many different inputs in the data
4. Target = “Try not to take on more than you can handle”
• has low entropy, since it is paired with few inputs in the data
[CPR19]
Source: Csaky et al, 2019. Improving neural conversational models with entropy-based data filtering

Understanding the Difficulty of Text Classification Tasks
Proposes an approach to design data quality metric which
explains the complexity of the given data for classification task
▪ Considers various data characteristics to generates a 48 dim
feature vector for each dataset.
▪ Data characteristics include
▪ Class Diversity: Count based probability distribution of classes in
the dataset
▪ Class Imbalance: σ𝑐=1
𝐶
|
1
𝐶
−
𝑛𝑐
𝑇𝑑𝑎𝑡𝑎
|
▪ Class Interference: Similarities among samples belonging to
different classes
▪ Data Complexity: Linguistic properties of data samples
KDD Tutorial / © 2020 IBM Corporation Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks
[CRZ18]
Feature vector for a given dataset cover
quality properties such as
▪ Class Diversity (2-Dim)
▪ Shannon Class Diversity
▪ Shannon Class Equitability
▪ Class Imbalance (1-Dim)
▪ Class Interference (24-Dim)
▪ Hellinger Similarity
▪ Top N-Gram Interference
▪ Mutual Information
▪ Data Complexity (21-Dim)
▪ Distinct n-gram : Total n-gram
▪ Inverse Flesch Reading Ease
▪ N-Gram and Character diversity

▪ Authors propose usage of genetic algorithms to intelligently explore the 248 possible combinations
▪ The fitness function for the genetic algorithm was Pearson correlation between difficulty score and
model performance on test set
▪ 89 datasets were considered for evaluation and 12 different types of models on each dataset
▪ The effectiveness of a given combination of metric is measured using its correlation with the
performance of various models on various datasets
▪ Stronger the negative correlation of a metric with model performance, better the metric explains data
complexity
[CRZ18]
Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Difficulty Measure D2 =
Distinct Unigrams : Total Unigrams + Class Imbalance +
Class Diversity + Maximum Unigram Hellinger Similarity
+ Unigram Mutual Info.
Correlation = −0.8814
[CRZ18]
Source : Collins et al, 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Next Steps
How to assess overall text quality?
▪ Framework that allows user to assess text quality across various dimensions
▪ Standardized set of quality metrics that output a score to indicate a low/high score
▪ Provide insights into the specific samples in the dataset which contribute to a low/high
score
▪ Specific recommendations for addressing poor text quality and evidence of model
performance improvement

Next Steps
How to assess quality of NLP models independent of task?
▪ Held-out datasets are often not comprehensive
▪ Contain the same biases as the training data
▪ Real-world performance may be overestimated
▪ Difficult to figure out where model is failing and how to fix it
[RWGS20]
Source : Ribeiro et al, 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Human In Loop Techniques for Data
Quality Assessments

Humans in the Loop
▪ Human intervention in data quality analysis
▪ Validation and feedback
▪ Possible enhancements
▪ Data readiness assessment
▪ Metrics for structured data
▪ Metrics for unstructured data

Human Input in the AI Lifecycle

Supporting Data Operations
Keywords:
▪ Preparation
▪ Integration
▪ Cleaning
▪ Curation
▪ Transformation
▪ Manipulation
Notable tools:
▪ Trifacta ~ Potter’s Wheel
▪ OpenRefine
▪ Wrangler
➢ How to validate that data is clean enough
➢ How to delegate sub-tasks between
respective role players / data workers
➢ How to enable reuse of human input and
codifying tacit knowledge
➢ How to provide cognitive and not just
statistical support
[Ham13]
[KPHH11]
[RH01]

Example: Label Noise Remediation
Identify noisy
samples
•Check target columns /
attributes to detect suspect
noisy or anomalous labels
•E.g. Target column with the
following entries :
A B Label
Abc def 1
Abc def 2
Abc def 1
Abc def 1
Visual preview of
noisy data
•Sampling to show presentable
number of records in batches
> Show suspected noisy class
labels and predicted labels to
preview
A B Label Predicted
Abc def 1 1
Abc def 2 1
Abc def 1 1
Abc def 1 1
Remediation input
from HIL
•Review
•Correct
•Approve
> HIL reviews if row is
noisy or not, verifies if
predicted label is correct or
not and additional marks
pertinent columns
corresponding
Document and
preserve
•Learn operations for
future recommendations
to reduce HIL effort and
enable reuse/transfer of
knowledge
> In future, better
prediction using pertinent
columns and user
actions

Example: Data Homogeneity and Transformation
Identify
heterogenous
samples
•Check target columns /
attributes to detect variation
on sample data
•E.g. Target column with the
following entries :
<=50K
<=50K.
>50K
>50K.
E.g. Gender column with the
following entries:
male
M
female
F
Visual preview of
problematic data
•Sampling to show
presentable number
•Give explanation
•Viewing options to reduce
human effort
> Show unique class labels
and ask for
correction/verification
> Show distinct formats and
ask for correction/verification
Transformation
input from HIL
•Review
•Correct
•Approve
> HIL sees anomaly in the
‘.’ appearing at end and
provides corresponding
transformation
> Rectify heterogenous
data formats and specify
rule for transformation
Document and
preserve
•Learn operations for
future recommendations
to reduce HIL effort and
enable reuse/transfer of
knowledge
> In future, suggest
remedial action
> Suggest possible
remediations or rules

Next Steps…
Shared collaborative workspace to streamline, regulate and document human input/expertise
Focus on reproducibility, traceability, and verifiability
Easy to user
specification
interface
e.g. graphical
interface, by
example or in
natural language
instead of
complex
grammars or
regular
expressions
Visual preview
appropriate
visualisation of
relevant data
chunks to see the
changes
Minimum
latency
effect of
transformation
should be quickly
visible
Clear lineage
ability to view
what changes
have been done
previously and by
whom
documentation of
operations and
observed effects
Enable
assessment of
data operations
basic analysis
using default ML
models to show
impact of
changes
Data versioning
to enable
experiments for
trial and testing
e.g. ability to
undo changes
Streamlined
collaboration
task lists based
on persona for
shared data
cleaning pipeline
[MLW+19]
[KHFW16]
[ROE+19]

Summary
▪ Addressing data quality issues before it enters an ML pipeline allows
taking remedial actions to reduce model building efforts and turn-around
times.
▪ Measuring the data quality in a systematic and objective manner,
through standardised metrics for example, can improve its reliability and
permit informed reuse.
▪ There are distinct metrics to quantify the data issues and anomalies
based on whether the data is structured or unstructured.
▪ Various data workers or personas are involved in the data quality
assessment pipeline depending on the level and type of human
intervention required.
▪ Streamlining the collaboration between usage of standardized metrics
and human in the loop inputs can serve as a powerful framework to
integrate and optimize the data quality assessment process.

We invite you to join us on this agenda towards a data-centric approach to analyzing data quality.
Contact Hima Patel with your ideas and enquiries.

References
[BF99] Carla E Brodley and Mark A Friedl. Identifying mislabeled training data. Journal of artificial intelligence research, 11:131–167,
1999.
[ZWC03] Xingquan Zhu, Xindong Wu, and Qijun Chen. Eliminating class noise in large datasets. In Proceedings of the 20th
International Conference on Machine Learning(ICML-03), pages 920–927, 2003.
[DLG17]J unhua Ding, XinChuan Li, and Venkat N Gudivada. Augmentation and evaluation of training data for deep learning. In2017
IEEE International Conference on Big Data(Big Data), pages 2603–2611. IEEE, 2017.
[ARK18] Mohammed Al-Rawi and Dimosthenis Karatzas. On the labeling correctness in computer vision datasets. In
IAL@PKDD/ECML, 2018.
[NJC19] Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. Confident learning: Estimating uncertainty in dataset labels. arXiv preprint
arXiv:1911.00068, 2019.
[Coo77] R Dennis Cook. Detection of influential observation in linear regression.T echnometrics
[EGH17] Rajmadhan Ekambaram, Dmitry B Goldgof, and Lawrence O Hall. Finding label noise examples in large scale datasets.
In2017 IEEE International Conference on Systems, Man, and Cybernetics pages 2420–2424., 2017.

References
[GZ19] Amirata Ghorbani and James Zou. Data shapley: Equitable valuation of data for machine learning. arXiv preprint
arXiv:1904.02868, 2019
[YAP19] Jinsung Yoon, Sercan O Arik, and Tomas Pfister. Data valuation using reinforcement learning. arXiv preprint
arXiv:1909.11671, 2019.
[DKO+07] Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, and Suresh Venkatasubramanian. Column heterogeneity
as a measure of data quality. 2007
[sim] http://www.countrysideinfo.co.uk/simpsons.htm.
[DHI12] AnHai Doan, Alon Halevy, and Zachary Ives. Principles of data integration. Elsevier, 2012.
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781, 2013
[ALI15] Aida Ali and Siti Mariyam Hj. Shamsuddin and Anca L. Ralescu. Classification with class imbalance problem: A review. SOCO
2015

References
[PG15] Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program synthesis. In Proceedings of the 2015
ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 107–126,
2015.
[DUB+17] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli.
Robustfill: Neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine Learning-Volume
70, pages 990–998, 2017.
[KMP+18] Ashwin Kalyan, Abhishek Mohta, Oleksandr Polozov, Dhruv Batra, Prateek Jain, and Sumit Gulwani. Neural-guided
deductive search for real-time program synthesis from examples. arXiv preprint arXiv:1804.01186, 2018.
[AYR16] Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. Automatic text scoring using neural networks. In Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 715–725, Berlin, Germany,
August 2016. Association for Computational Linguistics.
[CPR19] Richard Csaky, Patrik Purgai, and Gabor Recski. Improving neural conversational models with entropy-based data filtering.
arXiv preprint arXiv:1905.05471, 2019.

References
[CRZ18] Edward Collins, Nikolai Rozanov, and Bingbing Zhang. Evolutionary data measures: Understanding the difficulty of text
classification tasks. arXiv preprint arXiv:1811.01910, 2018.
[JMAS19] Parag Jain, Abhijit Mishra, Amar Prakash Azad, and Karthik Sankaranarayanan. Unsupervised controllable text
formalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6554–6561, 2019.
[KWAP17] Ramakrishnan Kannan, Hyenkyun Woo, Charu C Aggarwal, and Haesun Park. Outlier detection for text data. In
Proceedings of the 2017 siam international conference on data mining, pages 489–497. SIAM, 2017.
[MS18] Mohsen Mesgar and Michael Strube. A neural local coherence model for text quality assessment. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages 4328–4339, 2018.
[ÖG17] Robert Östling and Gintar ̇e Grigonyt ̇e. Transparent text quality assessment with convolutional neural networks. In
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 282–286, 2017.

References
[PLN19] Nicole Peinelt, Maria Liakata, and Dong Nguyen. Aiming beyond the obvious: Identifying non-obvious cases in semantic
similarity datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2792–2798,
2019.
[RDNMJ13] Marta Recasens, Cristian Danescu-Niculescu-Mizil, and Dan Jurafsky. Linguistic models for analyzing and detecting
biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 1650–1659, 2013.
[RWGS20] Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp
models with checklist. arXiv preprint arXiv:2005.04118, 2020.
[RZV+19] Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, and Marius Kloft. Self-attentive, multi-context one-
class classification for unsupervised anomaly detection on text. In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 4061–4071, 2019.
[RH01] Vijayshankar Raman and Joseph M Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, volume 1,
pages 381–390, 2001.
[ROE+19] El Kindi Rezig, Mourad Ouzzani, Ahmed K Elmagarmid, Walid G Aref, and Michael Stonebraker. Towards an end-to-end
human-centric data cleaning framework. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–7, 2019.

References
[Ham13] Kelli Ham. Openrefine (version 2.5). http://openrefine. org. free, open-source tool for cleaning and transforming data.
Journal of the Medical Library Association: JMLA, 101(3):233, 2013.
[KHFW16] Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. Towards reliable interactive data cleaning: A user
survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–5, 2016.
[KPHH11] Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. Wrangler: Interactive visual specification of data
transformation scripts. In ACM Human Factors in Computing Systems (CHI), 2011.
[MLW+19] Michael Muller, Ingrid Lange, Dakuo Wang, David Piorkowski, Jason Tsay, Q Vera Liao, Casey Dugan, and Thomas
Erickson. How data science workers work with data: Discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI
Conference on Human Factors in Computing Systems, pages 1–15, 2019.
[BTR15] Paula Branco, Luis Torgo, and Rita Ribeiro. A survey of predictive modelling under imbalanced distributions.arXiv preprint
arXiv:1505.01658, 2015

References
[DT10] Misha Denil and Thomas Trappenberg. Overlap versus imbalance. In Canadian conference on artificial intelligence, pages
220–231. Springer, 2010.
[GBSW04] Jerzy W Grzymala-Busse, Jerzy Stefanowski, and Szymon Wilk. A comparison of two approaches to data mining from
imbalanced data. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 757–
763. Springer, 2004
[Jap03] Nathalie Japkowicz. Class imbalances: are we focusing on the right issue. 2003.
[JJ04] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49,
2004.
[Kra16] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. InProgress in Artificial Intelligence
volume, 2016.
[LCT19] Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. Bayes imbalance impact index: A measure of class imbalanced data set for
classification problem. IEEE Transactions on Neural Networks and Learning Systems, 2019
[Jap03] Nathalie Japkowicz. Class imbalances: are we focusing on the right issue. 2003.

References
[DT10] Misha Denil and Thomas Trappenberg. Overlap versus imbalance. In Canadian conference on artificial intelligence, pages
220–231. Springer, 2010.
[GBSW04] Jerzy W Grzymala-Busse, Jerzy Stefanowski, and Szymon Wilk. A comparison of two approaches to data mining from
imbalanced data. In International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pages 757–
763. Springer, 2004
[JJ04] Taeho Jo and Nathalie Japkowicz. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter, 6(1):40–49,
2004.
[Kra16] Bartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. InProgress in Artificial Intelligence
volume, 2016.
[LCT19] Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. Bayes imbalance impact index: A measure of class imbalanced data set for
classification problem. IEEE Transactions on Neural Networks and Learning Systems, 2019
WP03] Gary M Weiss and Foster Provost. Learning when training data are costly: The effect of class distribution on tree induction.
Journal of artificial intelligence research,19:315–354, 2003

Overview and Importance of Data Quality for Machine Learning Tasks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Overview and Importance of Data Quality for Machine Learning Tasks

Similar to Overview and Importance of Data Quality for Machine Learning Tasks (20)

Recently uploaded

Recently uploaded (20)

Overview and Importance of Data Quality for Machine Learning Tasks