In the current scenario where every ML system requires a ton of data to train, changes in the data during model refreshment or even during production will cause a performance drop, sometimes quite significantly. It has become a tremendously important task in the ML system lifecycle to periodically check quality issues in the data stream itself. There are existing libraries, open-source tools or full-fledged SaaS platforms to monitor those data quality metrics but the metric used oftentimes becomes too generic and might not be useful at all.
There are simple data quality metrics, which can be developed individually and can be integrated with data quality tools/SaaS platforms to monitor them in production. In this talk, I will go through a couple of metrics for different types of data and use cases and how to use clustering and other unsupervised learning algorithms to build those metrics at the end will also try to show a demo with integrations and how it can be run in production.
3. Data Quality - as a definition
Is the data healthy enough to be used ?
● Is the data consistent enough to be used ?
● Is the data accurate enough to be used ?
● Is the data complete enough to be used ?
● Is the data recent enough to be used ?
Purpose of the data
● ML workload
● General product analysis
● Sales & Marketing analysis
● R&D
● AB Testing
4. Data Quality for Machine Learning
Data distribution shift
● Covariate shift
○ A covariate is an independent variable which can influence the outcome but which it self is not of a
direct interest
○ When the distribution of independent variable differs between train and test data
■ It can happen due to sample selection biases
■ Upsampling or downsampling can also cause covariate shift
■ Model learning process, such as active learning, can also cause covariate shift
■ In production, covariate shift happens primarily due to change in environment
○ If it is known earlier how the real-world input distribution will differ from the training input distribution,
importance weighting can be used in that scenario but it is highly unlikely that how the real-world data
distribution will be known
5. Data Quality for Machine Learning
Data distribution shift
● Label shift
○ Output distribution changes but for a given output the input distribution remains same
○ Since in covariate shift, change in distribution of independent variable will also influence dependent
variable, label shift also happens due to covariate shift
● Concept drift
○ Input distribution remains same but the conditional distribution of the output changes given an input.
Same input but different output
○ It can be cyclic or seasonal
Feature Change
When new features are added, old features removed all set of possible values for the features change
Label Schema Change
When set of possible values change
6. Data Quality Metrics
Summary Statistics // df.describe()
1. Mean 5. Min-Max Range 9. Percentage of Uniques
2. Median 6. Percentage of Null
3. Variance 7. Percentage of 0
4. Skewness 8. Standard Deviation
Advanced Metrics
1. Two sample hypothesis test
a. Determines if difference between two population is statistically significant enough
b. Caveat: Statistically significant doesn’t mean practically important; Observable in small sample size
increases statistical significance and practical importance as well.
c. Kolmogorov-Smirnov Test
i. A non-parametric statistical test to identify population significance
d. Least-Squares Density Difference
i. Based on least-squares density difference estimation method
7. Data Quality Metrics
Time-series specific
● Event Data loss
○ There are gaps in the time-series
● Value spikes
○ Sudden changes which are implausible for the domain
● Signal Noise
○ Inaccurate measurement
● Diverging sampling
○ Different sampling rate
● Inconsistent noise model
○ The level of noise changes in cyclic order
● Divergent despite correlation
○ Values which are correlated behaves differently
● Heteroscedasticity
○ Sub-population having different variabilities
8. Machine Learning for Data Quality
Dimensionality Reduction
This method is aimed to reduce the number of input variables in a dataset by projecting the original
high-dimensional input data to a low-dimensional space
● Uniform Manifold Approximation and Projection (UMAP)
○ The main feature of this algorithm is the nonlinear representation of data. Compared to other
dimensionality reduction algorithms, it is good at scaling dimensionality and size of a dataset and fast
projection
9. Machine Learning for Data Quality
Clustering
The goal of clustering is to detect distinct groups in an unlabeled dataset, where the users are expected to determine
the criteria of what is a correct cluster so that clustering results meet their expectations.
● Density-based spatial clustering of applications with noise (DBSCAN)
○ Takes all instances that are close to each other and groups them together, based on a distance
measurement and a minimum number of instances specified already
● Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
○ Converts the DBSCAN into a hierarchical manner and extracts a flat clustering based on the stability
of clusters.
Anomaly Detection
The anomaly detection algorithm is not independent itself. It often goes along with the dimensionality reduction
and clustering algorithms. By using the dimensionality reduction algorithm as a pre-stage for anomaly detection,
high-dimensional space can be transformed into a lower-dimensional one. Then the density of the major data
points in this lower-dimensional space can be figured out, which may be identified as normal. Those data points
located far away from the “normal” space are outliers or anomalies.
10. References
1. https://github.com/saradindusengupta/GDG_Cloud_Community_Day_Aug07_2022
2. Data quality in time series data: An experience report - Gitzel R, 2005
3. Towards Automated Data Quality Management for Machine Learning - Rukat, Tammo et al, 2022
4. Dataset Shift in Machine Learning - Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton
Schwaighofer and Neil D. Lawrence
5. https://eng.uber.com/monitoring-data-quality-at-scale/
6. https://towardsdatascience.com/automated-data-quality-testing-at-scale-with-sql-and-machine-learning-f3a6
8e79d8a8
7. What to Do about Missing Values in Time-Series Cross-Section Data - James Honaker, Gary King,
2010