Speaker
Saradindu
Sengupta
Community
Day 2022
August 7th, 2022
Conrad Bangalore
Senior ML Engineer @Nunam
Where I work on building learning systems to
forecast health and failure of Li-ion batteries.
Managing data quality in Machine Learning
Data Quality - as a definition
Is the data healthy enough to be used ?
● Is the data consistent enough to be used ?
● Is the data accurate enough to be used ?
● Is the data complete enough to be used ?
● Is the data recent enough to be used ?
Purpose of the data
● ML workload
● General product analysis
● Sales & Marketing analysis
● R&D
● AB Testing
Data Quality for Machine Learning
Data distribution shift
● Covariate shift
○ A covariate is an independent variable which can influence the outcome but which it self is not of a
direct interest
○ When the distribution of independent variable differs between train and test data
■ It can happen due to sample selection biases
■ Upsampling or downsampling can also cause covariate shift
■ Model learning process, such as active learning, can also cause covariate shift
■ In production, covariate shift happens primarily due to change in environment
○ If it is known earlier how the real-world input distribution will differ from the training input distribution,
importance weighting can be used in that scenario but it is highly unlikely that how the real-world data
distribution will be known
Data Quality for Machine Learning
Data distribution shift
● Label shift
○ Output distribution changes but for a given output the input distribution remains same
○ Since in covariate shift, change in distribution of independent variable will also influence dependent
variable, label shift also happens due to covariate shift
● Concept drift
○ Input distribution remains same but the conditional distribution of the output changes given an input.
Same input but different output
○ It can be cyclic or seasonal
Feature Change
When new features are added, old features removed all set of possible values for the features change
Label Schema Change
When set of possible values change
Data Quality Metrics
Summary Statistics // df.describe()
1. Mean 5. Min-Max Range 9. Percentage of Uniques
2. Median 6. Percentage of Null
3. Variance 7. Percentage of 0
4. Skewness 8. Standard Deviation
Advanced Metrics
1. Two sample hypothesis test
a. Determines if difference between two population is statistically significant enough
b. Caveat: Statistically significant doesn’t mean practically important; Observable in small sample size
increases statistical significance and practical importance as well.
c. Kolmogorov-Smirnov Test
i. A non-parametric statistical test to identify population significance
d. Least-Squares Density Difference
i. Based on least-squares density difference estimation method
Data Quality Metrics
Time-series specific
● Event Data loss
○ There are gaps in the time-series
● Value spikes
○ Sudden changes which are implausible for the domain
● Signal Noise
○ Inaccurate measurement
● Diverging sampling
○ Different sampling rate
● Inconsistent noise model
○ The level of noise changes in cyclic order
● Divergent despite correlation
○ Values which are correlated behaves differently
● Heteroscedasticity
○ Sub-population having different variabilities
Machine Learning for Data Quality
Dimensionality Reduction
This method is aimed to reduce the number of input variables in a dataset by projecting the original
high-dimensional input data to a low-dimensional space
● Uniform Manifold Approximation and Projection (UMAP)
○ The main feature of this algorithm is the nonlinear representation of data. Compared to other
dimensionality reduction algorithms, it is good at scaling dimensionality and size of a dataset and fast
projection
Machine Learning for Data Quality
Clustering
The goal of clustering is to detect distinct groups in an unlabeled dataset, where the users are expected to determine
the criteria of what is a correct cluster so that clustering results meet their expectations.
● Density-based spatial clustering of applications with noise (DBSCAN)
○ Takes all instances that are close to each other and groups them together, based on a distance
measurement and a minimum number of instances specified already
● Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
○ Converts the DBSCAN into a hierarchical manner and extracts a flat clustering based on the stability
of clusters.
Anomaly Detection
The anomaly detection algorithm is not independent itself. It often goes along with the dimensionality reduction
and clustering algorithms. By using the dimensionality reduction algorithm as a pre-stage for anomaly detection,
high-dimensional space can be transformed into a lower-dimensional one. Then the density of the major data
points in this lower-dimensional space can be figured out, which may be identified as normal. Those data points
located far away from the “normal” space are outliers or anomalies.
References
1. https://github.com/saradindusengupta/GDG_Cloud_Community_Day_Aug07_2022
2. Data quality in time series data: An experience report - Gitzel R, 2005
3. Towards Automated Data Quality Management for Machine Learning - Rukat, Tammo et al, 2022
4. Dataset Shift in Machine Learning - Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton
Schwaighofer and Neil D. Lawrence
5. https://eng.uber.com/monitoring-data-quality-at-scale/
6. https://towardsdatascience.com/automated-data-quality-testing-at-scale-with-sql-and-machine-learning-f3a6
8e79d8a8
7. What to Do about Missing Values in Time-Series Cross-Section Data - James Honaker, Gary King,
2010
Community
Day 2022
August 7th, 2022
Conrad Bangalore
Thank You
/in/saradindusengupta @iamsaradindu /saradindusengupta

GDG Cloud Community Day 2022 - Managing data quality in Machine Learning

  • 1.
    Speaker Saradindu Sengupta Community Day 2022 August 7th,2022 Conrad Bangalore Senior ML Engineer @Nunam Where I work on building learning systems to forecast health and failure of Li-ion batteries.
  • 2.
    Managing data qualityin Machine Learning
  • 3.
    Data Quality -as a definition Is the data healthy enough to be used ? ● Is the data consistent enough to be used ? ● Is the data accurate enough to be used ? ● Is the data complete enough to be used ? ● Is the data recent enough to be used ? Purpose of the data ● ML workload ● General product analysis ● Sales & Marketing analysis ● R&D ● AB Testing
  • 4.
    Data Quality forMachine Learning Data distribution shift ● Covariate shift ○ A covariate is an independent variable which can influence the outcome but which it self is not of a direct interest ○ When the distribution of independent variable differs between train and test data ■ It can happen due to sample selection biases ■ Upsampling or downsampling can also cause covariate shift ■ Model learning process, such as active learning, can also cause covariate shift ■ In production, covariate shift happens primarily due to change in environment ○ If it is known earlier how the real-world input distribution will differ from the training input distribution, importance weighting can be used in that scenario but it is highly unlikely that how the real-world data distribution will be known
  • 5.
    Data Quality forMachine Learning Data distribution shift ● Label shift ○ Output distribution changes but for a given output the input distribution remains same ○ Since in covariate shift, change in distribution of independent variable will also influence dependent variable, label shift also happens due to covariate shift ● Concept drift ○ Input distribution remains same but the conditional distribution of the output changes given an input. Same input but different output ○ It can be cyclic or seasonal Feature Change When new features are added, old features removed all set of possible values for the features change Label Schema Change When set of possible values change
  • 6.
    Data Quality Metrics SummaryStatistics // df.describe() 1. Mean 5. Min-Max Range 9. Percentage of Uniques 2. Median 6. Percentage of Null 3. Variance 7. Percentage of 0 4. Skewness 8. Standard Deviation Advanced Metrics 1. Two sample hypothesis test a. Determines if difference between two population is statistically significant enough b. Caveat: Statistically significant doesn’t mean practically important; Observable in small sample size increases statistical significance and practical importance as well. c. Kolmogorov-Smirnov Test i. A non-parametric statistical test to identify population significance d. Least-Squares Density Difference i. Based on least-squares density difference estimation method
  • 7.
    Data Quality Metrics Time-seriesspecific ● Event Data loss ○ There are gaps in the time-series ● Value spikes ○ Sudden changes which are implausible for the domain ● Signal Noise ○ Inaccurate measurement ● Diverging sampling ○ Different sampling rate ● Inconsistent noise model ○ The level of noise changes in cyclic order ● Divergent despite correlation ○ Values which are correlated behaves differently ● Heteroscedasticity ○ Sub-population having different variabilities
  • 8.
    Machine Learning forData Quality Dimensionality Reduction This method is aimed to reduce the number of input variables in a dataset by projecting the original high-dimensional input data to a low-dimensional space ● Uniform Manifold Approximation and Projection (UMAP) ○ The main feature of this algorithm is the nonlinear representation of data. Compared to other dimensionality reduction algorithms, it is good at scaling dimensionality and size of a dataset and fast projection
  • 9.
    Machine Learning forData Quality Clustering The goal of clustering is to detect distinct groups in an unlabeled dataset, where the users are expected to determine the criteria of what is a correct cluster so that clustering results meet their expectations. ● Density-based spatial clustering of applications with noise (DBSCAN) ○ Takes all instances that are close to each other and groups them together, based on a distance measurement and a minimum number of instances specified already ● Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) ○ Converts the DBSCAN into a hierarchical manner and extracts a flat clustering based on the stability of clusters. Anomaly Detection The anomaly detection algorithm is not independent itself. It often goes along with the dimensionality reduction and clustering algorithms. By using the dimensionality reduction algorithm as a pre-stage for anomaly detection, high-dimensional space can be transformed into a lower-dimensional one. Then the density of the major data points in this lower-dimensional space can be figured out, which may be identified as normal. Those data points located far away from the “normal” space are outliers or anomalies.
  • 10.
    References 1. https://github.com/saradindusengupta/GDG_Cloud_Community_Day_Aug07_2022 2. Dataquality in time series data: An experience report - Gitzel R, 2005 3. Towards Automated Data Quality Management for Machine Learning - Rukat, Tammo et al, 2022 4. Dataset Shift in Machine Learning - Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence 5. https://eng.uber.com/monitoring-data-quality-at-scale/ 6. https://towardsdatascience.com/automated-data-quality-testing-at-scale-with-sql-and-machine-learning-f3a6 8e79d8a8 7. What to Do about Missing Values in Time-Series Cross-Section Data - James Honaker, Gary King, 2010
  • 11.
    Community Day 2022 August 7th,2022 Conrad Bangalore Thank You /in/saradindusengupta @iamsaradindu /saradindusengupta