Slides from lecture style tutorial on data quality for ML delivered at SIGKDD 2021.
The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Data remains susceptible to errors or irregularities that may be introduced during collection, aggregation or annotation stage. This necessitates profiling and assessment of data to understand its suitability for machine learning tasks and failure to do so can result in inaccurate analytics and unreliable decisions. While researchers and practitioners have focused on improving the quality of models (such as neural architecture search and automated feature selection), there are limited efforts towards improving the data quality.
Assessing the quality of the data across intelligently designed metrics and developing corresponding transformation operations to address the quality gaps helps to reduce the effort of a data scientist for iterative debugging of the ML pipeline to improve model performance. This tutorial highlights the importance of analysing data quality in terms of its value for machine learning applications. Finding the data quality issues in data helps different personas like data stewards, data scientists, subject matter experts, or machine learning scientists to get relevant data insights and take remedial actions to rectify any issue. This tutorial surveys all the important data quality related approaches for structured, unstructured and spatio-temporal domains discussed in literature, focusing on the intuition behind them, highlighting their strengths and similarities, and illustrates their applicability to real-world problems.
49. Source
Outliers - Regression Task
1.There is one outlier far from the other points, though it only appears to slightly influence the line.
2.There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn’t very influential.
3.There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how
the line around the primary cloud doesn’t appear to fit very well.
4.There is a primary cloud and then a small secondary cloud of four outliers. The secondary cloud appears to be influencing the
line somewhat strongly, making the least square line fit poorly almost everywhere. There might be an interesting explanation for
the dual clouds, which is something that could be investigated.
5.There is no obvious trend in the main cloud of points and the outlier on the right appears to largely control the slope of the least
squares line.
6.There is one outlier far from the cloud, however, it falls quite close to the least squares line and does not appear to be very
influential.
Spatio-temporal data in data science.
Applications of ST data. This is like a business point of view.
Introduce several outlier detection technique for spatio-temporal data
Outliers detection and removal is an important preprocessing step prior to building a machine learning model. Usually outliers are those points in the data which do not follow the general trends and stand out when compared to other points. Such points if not removed from the dataset, might hamper the ability of a machine learning model to capture the data properties in a generalized way.
It is relatively easy to understand the concept of outlier in the case of tabular or timeseries data. For example, by simply plotting a given tabular data, as shown in Fig1, one can observe that one point lies away from the general trend of the data and can be treated as an outlier. Or, one can also find outliers by plotting statistical nature of the data, such as using box plots in Fig2 and identify points lying far away from the usual data distribution. Even in case of timeseries data, a simple plotting as shown in Fig3 can help one discover outliers present in the data.
Similar is not the case for text data, given a text classification data, it is not clear what an outlier can be or how to identify them. There are multiple ways to interpret an outlier in text data. Given a dataset, can outlier be a data point which is topically diverse from data data points (given political news corpus, presence of sports news article) or is it some gibberish text (which can be found in product reviews, tweets, etc.) or some incomplete sentence which does not convey any meaning or data points having some foreign language when compared to majority (French data points in an English corpus)
Here are the few samples from the popular IMDB sentiment analysis dataset. This dataset has a movie review scraped from IMDB and a label associated to each review to indicate if the review is positive or negative.
A data sample can be anomalous due to various reasons such as
there is a lot of repetitive content in it, as shown in the first example
the sample is difficult to comprehend even for a human and assign a label, as shown in the second example
the sample is incomplete, as shown in the third example
From these examples, it can be observed that leave alone the model, it might sometimes becomes difficult for a human to understand a sample and provide a label to it.
Having shared some examples, I would like to discuss two approaches for anomaly or outlier detection in text.
The first one is a very classical approach based on the matrix factorization technique adapted to function on text data
The second replies more on newer DL based techniques such as pretrained word embeddings, self-attention to identify anomalies in the data
Outlier Detection for Text Data was published in SIAM 2017 and proposes matrix factorization techniques to detect outliers in text data
Matrix factorization technique is predominantly used in recommender systems to decompose an interaction matrix into two lower dimensional matrices.
The same technique is applied here to identify outliers in the given text data.
Firstly, a numeric representation of data is required and a simple way of representing a document can be in the form of words present in them, which is called as a Bag of Words approach
Given a set of documents, we represent them in terms of a term matrix Amxn where m is the number of unique words in given set of documents and n is the number of given documents
Now, we explore the matrix factorization techniques to decompose the term matrix A into a low rank matrix (L) and an outlier matrix (Z)
Further the low rank matrix L can be represented as a product of two matrices W, H. Intuitively, this corresponds to the case that every document ai , is represented as the linear combination of the r topics. In cases, where this is not true, the document is an outlier, and those unrepresentable sections of the matrix are captured by the non-zero entries in the Z0 matrix
In order to get the matrix Z, we solve the optimization problem where were find values for the matrices W,H,Z which closely approximates A
The L1,2-norm penalty on Z defines the sum of the L2 norm outlier scores over all the documents. Therefore, the optimization problem essentially tries to find the best model, an important component of which is to minimize the sum of the outlier scores over all documents
Once the equation is solved, each entry in the Z corresponds to a term in a document. Since we are interested in the outlier behaviour of entire document the aggregate outlier behaviour of the document x can be modelled with the L2 norm score of a particular column zx.
For high dimensional data such as text data, sparse coefficients are required for obtaining an interpretable low rank matrix WH, hence and additional l1 penalty is imposed on the matrix H
Due to L1,2-norm, this optimization corresponds to the two block non-smooth BCD framework, where the problem is solved in two steps – In step1, we freeze WH and find the value for Z and in step2, we freeze Z and find values for W,H according to the problem formulation in equation2
Additionally, we partition the matrix Z into vector blocks zi and construct Z as a set of vectors zi. This way, we are imposing a semantic constraint that outliers in one document do not effects the other documents and also, when all other blocks of w1,···,wr,h1,···,hr, are fixed, every vector zi∈Z, can be solved to optimal in parallel.
As mentioned earlier, once the optimization problem converges, we can identify outlier documents by aggregating the outlier scores present in each column of Z. Higher the aggregate score of a column, higher the chance that document being an outlier
Since we saw how a traditional techniques such as matrix factorization be used to detect outliers in a dataset, we now move on to the newer approaches which rely on DL and pretrained word embeddings to detect outliers in the data
This is one of the recent papers published in a prominent NLP conference ACL in the year 2019. Unlike the previous approach, this paper relies on recent techniques such as pretrained word embeddings and deep learning architectures to identify outliers present in the data.
This paper, proposes a one-class classification method which takes as input the word embeddings and identified if a document is anomaly or not.
Similar to the previous technique, the first step we do here is to represent text in numerical format. For this, we rely on pretrained word embeddings. For those of you who are unaware of what word embeddings are, word embedding is a numerical representation of a word which is learnt over huge text corpus such as news, Wikipedia, etc. These word embeddings come in various sizes such as 50D, 100D, 300D, etc and have interesting properties such as words having similar meaning occur nearby in the hyperspace, and identify interesting relationships among words such as countries and capitals, countries and currencies, etc.
To put simply, understand word embedding as a lookup table or a dictionary when queries with a word, provides a numeric representation of it.
Now for each sentence in a word, embeddings are obtained and the next task is to represent all the sentences in a fixed size representation irrespective of the number of words in it.
For this, the authors rely on a technique called Multi-Head self attention which maps variable length sentences to a fixed length representation and additionally, gives multiple numeric representations for the same sentence considering multiple contexts. For example, the word MARCH can be a month or a march past, hence based on the context, a different representation is required for a given sentence.
In the self attention, given a word embedding matrix of the data, we compute attention matrix as shown in equation1
Once the attention matrix is computed, as shown in Eq2, we multiply it with the word embedding matrix to get sentence representations
Once the sentence level embeddings are obtained, the authors now propose determination of context vectors. These context vectors are expected to behave similarly as the sentence representation with an additional constraint that different context vectors for the same sentence captures diverse contexts.
The context vectors are determined using Eq1 which minimizes the cosine distance between the sentence embeddings M and the context vectors C, as mentioned the orthogonality constraint is imposed on the context vectors to capture diverse contexts
Once converged, most of the context vectors have similar representation to that of sentence embedding matrix M and some do not. We can quantify the similarity between these two representations using a cosine distance function.
For a given context, cosine distance is computed with respective representations of sentence and the context vectors. Now, in order to get a single score, the scores for all the contexts needs to be aggregated – they can either be given same weights and averaged as shown in the equation or can be assigned different weights and a weighted average can be taken.
Higher the score s(H) for a given sentence, greater the chance it is an outlier – since its context vector is away from the sentence embedding.
Now, we have seen various quality metrics for quantifying quality of given text data. How do we know what is the right combination of metrics to understand the complexity of a dataset for classification task. This paper attempts to solve this exact problem.
This paper proposes an approach to design data quality metric which explains the complexity of the given data for classification task by considering various properties of the data. The authors consider 4 properties of the data namely – class diversity, class imbalance, class interference and, data complexity.
Class Diversity characteristic is used to get the count based probability distribution of classes in the dataset – such as Shannon Class Diversity & Shannon Class Equitability – these metrics consider the class distribution and measure diversity of the dataset
Class Imbalance characteristic is used to measure the amount of imbalance present in the data and is computed using the formula shown
Class interference characteristic is used to measure similarity among samples belonging to difference classes.
Hellinger Similarity: measures similarity between two probability distributions
Top N-gram interference: Average Jaccard similarity between the set of the top 10 most frequent n-grams from each class.
Mutual Information: Average mutual information score between the set of the top 10 most frequent n-grams from each class.
Data Complexity characteristic is used to measure complexity of data based on linguistic properties
Distinct n-gram : Total n-gram - Count of distinct n-grams in a dataset divided by the total number of n-grams.
Inverse Flesch Reading Ease - grades text from 100 to 0, 100 indicating most readable and 0 indicating difficult to read. We take the reciprocal of this measure.
N-Gram and Character diversity - Using the Shannon Index and Equitability, we calculate the diversity and equitability of n-grams and characters
The authors consider these various properties of each data characteristic and proposed a 48 dimensional feature vector to represent complexity of the given data. On the right side you can see the different characteristics of data and properties within each characteristics. The number in the parenthesis denotes the number of dimensions assigned to each data characteristic
Once the 48 dimensional feature vector is constructed for each dataset, there are 248 possible combinations of metrics which can be designed from this feature representation
In order to intelligently traverse this search space and find the best metric, authors propose usage of genetic algorithms. These algorithms function based on a fitness function to rank a combination. In the current settings, authors use the pearson correlation between the difficulty score obtained from a given combination and the accuracies of different models obtained on this dataset. Stronger the negative correlation of metric and the model accuracy, better is the metric
To identify the best metric, authors use a huge database consisting of 89 datasets and also consider 12 different model for each dataset
Based on the experiments, authors showcase the best metric they identified to describe the data complexity. On the right you can see the metric which has a strong negative correlation of -0.88 with the model accuracy.
For a qualitative analysis, we can look at the provided plot. On the X-axis is the difficulty measure of a dataset measure using the given metric D2 and on the Y-axis is the F1 score of the models
It can be observed from the plots that as we move from left to right on the X-axis, the model performances are dropping
For low data difficulty measure such as 2, model performances lie above 0.9, but when we consider higher data difficult measure such as 5, we can see model performances lie in the range of 0.2-0.4, thereby illustrating the effectiveness of the identified metric.
Once the 48 dimensional feature vector is constructed for each dataset, there are 248 possible combinations of metrics which can be designed from this feature representation
In order to intelligently traverse this search space and find the best metric, authors propose usage of genetic algorithms. These algorithms function based on a fitness function to rank a combination. In the current settings, authors use the pearson correlation between the difficulty score obtained from a given combination and the accuracies of different models obtained on this dataset. Stronger the negative correlation of metric and the model accuracy, better is the metric
To identify the best metric, authors use a huge database consisting of 89 datasets and also consider 12 different model for each dataset
Based on the experiments, authors showcase the best metric they identified to describe the data complexity. On the right you can see the metric which has a strong negative correlation of -0.88 with the model accuracy.
For a qualitative analysis, we can look at the provided plot. On the X-axis is the difficulty measure of a dataset measure using the given metric D2 and on the Y-axis is the F1 score of the models
It can be observed from the plots that as we move from left to right on the X-axis, the model performances are dropping
For low data difficulty measure such as 2, model performances lie above 0.9, but when we consider higher data difficult measure such as 5, we can see model performances lie in the range of 0.2-0.4, thereby illustrating the effectiveness of the identified metric.