SlideShare a Scribd company logo
1 of 11
Download to read offline
Speaker
Saradindu
Sengupta
Community
Day 2022
August 7th, 2022
Conrad Bangalore
Senior ML Engineer @Nunam
Where I work on building learning systems to
forecast health and failure of Li-ion batteries.
Managing data quality in Machine Learning
Data Quality - as a definition
Is the data healthy enough to be used ?
● Is the data consistent enough to be used ?
● Is the data accurate enough to be used ?
● Is the data complete enough to be used ?
● Is the data recent enough to be used ?
Purpose of the data
● ML workload
● General product analysis
● Sales & Marketing analysis
● R&D
● AB Testing
Data Quality for Machine Learning
Data distribution shift
● Covariate shift
○ A covariate is an independent variable which can influence the outcome but which it self is not of a
direct interest
○ When the distribution of independent variable differs between train and test data
■ It can happen due to sample selection biases
■ Upsampling or downsampling can also cause covariate shift
■ Model learning process, such as active learning, can also cause covariate shift
■ In production, covariate shift happens primarily due to change in environment
○ If it is known earlier how the real-world input distribution will differ from the training input distribution,
importance weighting can be used in that scenario but it is highly unlikely that how the real-world data
distribution will be known
Data Quality for Machine Learning
Data distribution shift
● Label shift
○ Output distribution changes but for a given output the input distribution remains same
○ Since in covariate shift, change in distribution of independent variable will also influence dependent
variable, label shift also happens due to covariate shift
● Concept drift
○ Input distribution remains same but the conditional distribution of the output changes given an input.
Same input but different output
○ It can be cyclic or seasonal
Feature Change
When new features are added, old features removed all set of possible values for the features change
Label Schema Change
When set of possible values change
Data Quality Metrics
Summary Statistics // df.describe()
1. Mean 5. Min-Max Range 9. Percentage of Uniques
2. Median 6. Percentage of Null
3. Variance 7. Percentage of 0
4. Skewness 8. Standard Deviation
Advanced Metrics
1. Two sample hypothesis test
a. Determines if difference between two population is statistically significant enough
b. Caveat: Statistically significant doesn’t mean practically important; Observable in small sample size
increases statistical significance and practical importance as well.
c. Kolmogorov-Smirnov Test
i. A non-parametric statistical test to identify population significance
d. Least-Squares Density Difference
i. Based on least-squares density difference estimation method
Data Quality Metrics
Time-series specific
● Event Data loss
○ There are gaps in the time-series
● Value spikes
○ Sudden changes which are implausible for the domain
● Signal Noise
○ Inaccurate measurement
● Diverging sampling
○ Different sampling rate
● Inconsistent noise model
○ The level of noise changes in cyclic order
● Divergent despite correlation
○ Values which are correlated behaves differently
● Heteroscedasticity
○ Sub-population having different variabilities
Machine Learning for Data Quality
Dimensionality Reduction
This method is aimed to reduce the number of input variables in a dataset by projecting the original
high-dimensional input data to a low-dimensional space
● Uniform Manifold Approximation and Projection (UMAP)
○ The main feature of this algorithm is the nonlinear representation of data. Compared to other
dimensionality reduction algorithms, it is good at scaling dimensionality and size of a dataset and fast
projection
Machine Learning for Data Quality
Clustering
The goal of clustering is to detect distinct groups in an unlabeled dataset, where the users are expected to determine
the criteria of what is a correct cluster so that clustering results meet their expectations.
● Density-based spatial clustering of applications with noise (DBSCAN)
○ Takes all instances that are close to each other and groups them together, based on a distance
measurement and a minimum number of instances specified already
● Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
○ Converts the DBSCAN into a hierarchical manner and extracts a flat clustering based on the stability
of clusters.
Anomaly Detection
The anomaly detection algorithm is not independent itself. It often goes along with the dimensionality reduction
and clustering algorithms. By using the dimensionality reduction algorithm as a pre-stage for anomaly detection,
high-dimensional space can be transformed into a lower-dimensional one. Then the density of the major data
points in this lower-dimensional space can be figured out, which may be identified as normal. Those data points
located far away from the “normal” space are outliers or anomalies.
References
1. https://github.com/saradindusengupta/GDG_Cloud_Community_Day_Aug07_2022
2. Data quality in time series data: An experience report - Gitzel R, 2005
3. Towards Automated Data Quality Management for Machine Learning - Rukat, Tammo et al, 2022
4. Dataset Shift in Machine Learning - Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton
Schwaighofer and Neil D. Lawrence
5. https://eng.uber.com/monitoring-data-quality-at-scale/
6. https://towardsdatascience.com/automated-data-quality-testing-at-scale-with-sql-and-machine-learning-f3a6
8e79d8a8
7. What to Do about Missing Values in Time-Series Cross-Section Data - James Honaker, Gary King,
2010
Community
Day 2022
August 7th, 2022
Conrad Bangalore
Thank You
/in/saradindusengupta @iamsaradindu /saradindusengupta

More Related Content

Similar to GDG Cloud Community Day 2022 - Managing data quality in Machine Learning

CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESIRJET Journal
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3Luis Borbon
 
Recuriter Recommendation System
Recuriter Recommendation SystemRecuriter Recommendation System
Recuriter Recommendation SystemIRJET Journal
 
Machine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation dataMachine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation datajagan477830
 
Researc-paper_Project Work Phase-1 PPT (21CS09).pptx
Researc-paper_Project Work Phase-1 PPT (21CS09).pptxResearc-paper_Project Work Phase-1 PPT (21CS09).pptx
Researc-paper_Project Work Phase-1 PPT (21CS09).pptxAdityaKumar993506
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET Journal
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...DrPArivalaganASSTPRO
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesIRJET Journal
 
Optimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxOptimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxMurindanyiSudi1
 
Generalized Linear Model and it Challenges
Generalized Linear Model and it ChallengesGeneralized Linear Model and it Challenges
Generalized Linear Model and it ChallengesElBak1
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection TechniquesSwati .
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxAkash527744
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptxMurindanyiSudi1
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial IndustrySubrat Panda, PhD
 
Towards Confidence-aware Calibrated Recommendation (Slides)
Towards Confidence-aware Calibrated Recommendation (Slides)Towards Confidence-aware Calibrated Recommendation (Slides)
Towards Confidence-aware Calibrated Recommendation (Slides)Hossein A. (Saeed) Rahmani
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringDr Nisha Arora
 
Data Mining Module 3 Business Analtics..pdf
Data Mining Module 3 Business Analtics..pdfData Mining Module 3 Business Analtics..pdf
Data Mining Module 3 Business Analtics..pdfJayanti Pande
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 

Similar to GDG Cloud Community Day 2022 - Managing data quality in Machine Learning (20)

CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
 
Machine learning - session 3
Machine learning - session 3Machine learning - session 3
Machine learning - session 3
 
Recuriter Recommendation System
Recuriter Recommendation SystemRecuriter Recommendation System
Recuriter Recommendation System
 
Machine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation dataMachine Learning statistical model using Transportation data
Machine Learning statistical model using Transportation data
 
Researc-paper_Project Work Phase-1 PPT (21CS09).pptx
Researc-paper_Project Work Phase-1 PPT (21CS09).pptxResearc-paper_Project Work Phase-1 PPT (21CS09).pptx
Researc-paper_Project Work Phase-1 PPT (21CS09).pptx
 
IRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data ScienceIRJET - An Overview of Machine Learning Algorithms for Data Science
IRJET - An Overview of Machine Learning Algorithms for Data Science
 
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
few common Feature of Size Datum Features are bores, cylinders, slots, or tab...
 
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESIMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
 
Optimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptxOptimal Model Complexity (1).pptx
Optimal Model Complexity (1).pptx
 
Generalized Linear Model and it Challenges
Generalized Linear Model and it ChallengesGeneralized Linear Model and it Challenges
Generalized Linear Model and it Challenges
 
Model Selection Techniques
Model Selection TechniquesModel Selection Techniques
Model Selection Techniques
 
DataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptxDataMiningOverview_Galambos_2015_06_04.pptx
DataMiningOverview_Galambos_2015_06_04.pptx
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptx
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Towards Confidence-aware Calibrated Recommendation (Slides)
Towards Confidence-aware Calibrated Recommendation (Slides)Towards Confidence-aware Calibrated Recommendation (Slides)
Towards Confidence-aware Calibrated Recommendation (Slides)
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Data Mining Module 3 Business Analtics..pdf
Data Mining Module 3 Business Analtics..pdfData Mining Module 3 Business Analtics..pdf
Data Mining Module 3 Business Analtics..pdf
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 

More from SARADINDU SENGUPTA

Solar Energy Output Forecasting from SolarGIS Data for Connected Grid Station
Solar Energy Output Forecasting from SolarGIS Data for Connected Grid StationSolar Energy Output Forecasting from SolarGIS Data for Connected Grid Station
Solar Energy Output Forecasting from SolarGIS Data for Connected Grid StationSARADINDU SENGUPTA
 
An Analytical Comparison of Different Regularization Parameter Selection Meth...
An Analytical Comparison of Different Regularization Parameter Selection Meth...An Analytical Comparison of Different Regularization Parameter Selection Meth...
An Analytical Comparison of Different Regularization Parameter Selection Meth...SARADINDU SENGUPTA
 
Pydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingSARADINDU SENGUPTA
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionSARADINDU SENGUPTA
 
GDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in productionSARADINDU SENGUPTA
 
PyData Global 2022 - Lightning Talk - Bessel's Correction
PyData Global 2022 - Lightning Talk - Bessel's CorrectionPyData Global 2022 - Lightning Talk - Bessel's Correction
PyData Global 2022 - Lightning Talk - Bessel's CorrectionSARADINDU SENGUPTA
 
PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...SARADINDU SENGUPTA
 

More from SARADINDU SENGUPTA (7)

Solar Energy Output Forecasting from SolarGIS Data for Connected Grid Station
Solar Energy Output Forecasting from SolarGIS Data for Connected Grid StationSolar Energy Output Forecasting from SolarGIS Data for Connected Grid Station
Solar Energy Output Forecasting from SolarGIS Data for Connected Grid Station
 
An Analytical Comparison of Different Regularization Parameter Selection Meth...
An Analytical Comparison of Different Regularization Parameter Selection Meth...An Analytical Comparison of Different Regularization Parameter Selection Meth...
An Analytical Comparison of Different Regularization Parameter Selection Meth...
 
Pydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
 
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
 
GDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in production
 
PyData Global 2022 - Lightning Talk - Bessel's Correction
PyData Global 2022 - Lightning Talk - Bessel's CorrectionPyData Global 2022 - Lightning Talk - Bessel's Correction
PyData Global 2022 - Lightning Talk - Bessel's Correction
 
PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...
 

Recently uploaded

Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Servicejennyeacort
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts ServiceCall Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
Call Girls In Noida City Center Metro 24/7✡️9711147426✡️ Escorts Service
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

GDG Cloud Community Day 2022 - Managing data quality in Machine Learning

  • 1. Speaker Saradindu Sengupta Community Day 2022 August 7th, 2022 Conrad Bangalore Senior ML Engineer @Nunam Where I work on building learning systems to forecast health and failure of Li-ion batteries.
  • 2. Managing data quality in Machine Learning
  • 3. Data Quality - as a definition Is the data healthy enough to be used ? ● Is the data consistent enough to be used ? ● Is the data accurate enough to be used ? ● Is the data complete enough to be used ? ● Is the data recent enough to be used ? Purpose of the data ● ML workload ● General product analysis ● Sales & Marketing analysis ● R&D ● AB Testing
  • 4. Data Quality for Machine Learning Data distribution shift ● Covariate shift ○ A covariate is an independent variable which can influence the outcome but which it self is not of a direct interest ○ When the distribution of independent variable differs between train and test data ■ It can happen due to sample selection biases ■ Upsampling or downsampling can also cause covariate shift ■ Model learning process, such as active learning, can also cause covariate shift ■ In production, covariate shift happens primarily due to change in environment ○ If it is known earlier how the real-world input distribution will differ from the training input distribution, importance weighting can be used in that scenario but it is highly unlikely that how the real-world data distribution will be known
  • 5. Data Quality for Machine Learning Data distribution shift ● Label shift ○ Output distribution changes but for a given output the input distribution remains same ○ Since in covariate shift, change in distribution of independent variable will also influence dependent variable, label shift also happens due to covariate shift ● Concept drift ○ Input distribution remains same but the conditional distribution of the output changes given an input. Same input but different output ○ It can be cyclic or seasonal Feature Change When new features are added, old features removed all set of possible values for the features change Label Schema Change When set of possible values change
  • 6. Data Quality Metrics Summary Statistics // df.describe() 1. Mean 5. Min-Max Range 9. Percentage of Uniques 2. Median 6. Percentage of Null 3. Variance 7. Percentage of 0 4. Skewness 8. Standard Deviation Advanced Metrics 1. Two sample hypothesis test a. Determines if difference between two population is statistically significant enough b. Caveat: Statistically significant doesn’t mean practically important; Observable in small sample size increases statistical significance and practical importance as well. c. Kolmogorov-Smirnov Test i. A non-parametric statistical test to identify population significance d. Least-Squares Density Difference i. Based on least-squares density difference estimation method
  • 7. Data Quality Metrics Time-series specific ● Event Data loss ○ There are gaps in the time-series ● Value spikes ○ Sudden changes which are implausible for the domain ● Signal Noise ○ Inaccurate measurement ● Diverging sampling ○ Different sampling rate ● Inconsistent noise model ○ The level of noise changes in cyclic order ● Divergent despite correlation ○ Values which are correlated behaves differently ● Heteroscedasticity ○ Sub-population having different variabilities
  • 8. Machine Learning for Data Quality Dimensionality Reduction This method is aimed to reduce the number of input variables in a dataset by projecting the original high-dimensional input data to a low-dimensional space ● Uniform Manifold Approximation and Projection (UMAP) ○ The main feature of this algorithm is the nonlinear representation of data. Compared to other dimensionality reduction algorithms, it is good at scaling dimensionality and size of a dataset and fast projection
  • 9. Machine Learning for Data Quality Clustering The goal of clustering is to detect distinct groups in an unlabeled dataset, where the users are expected to determine the criteria of what is a correct cluster so that clustering results meet their expectations. ● Density-based spatial clustering of applications with noise (DBSCAN) ○ Takes all instances that are close to each other and groups them together, based on a distance measurement and a minimum number of instances specified already ● Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) ○ Converts the DBSCAN into a hierarchical manner and extracts a flat clustering based on the stability of clusters. Anomaly Detection The anomaly detection algorithm is not independent itself. It often goes along with the dimensionality reduction and clustering algorithms. By using the dimensionality reduction algorithm as a pre-stage for anomaly detection, high-dimensional space can be transformed into a lower-dimensional one. Then the density of the major data points in this lower-dimensional space can be figured out, which may be identified as normal. Those data points located far away from the “normal” space are outliers or anomalies.
  • 10. References 1. https://github.com/saradindusengupta/GDG_Cloud_Community_Day_Aug07_2022 2. Data quality in time series data: An experience report - Gitzel R, 2005 3. Towards Automated Data Quality Management for Machine Learning - Rukat, Tammo et al, 2022 4. Dataset Shift in Machine Learning - Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer and Neil D. Lawrence 5. https://eng.uber.com/monitoring-data-quality-at-scale/ 6. https://towardsdatascience.com/automated-data-quality-testing-at-scale-with-sql-and-machine-learning-f3a6 8e79d8a8 7. What to Do about Missing Values in Time-Series Cross-Section Data - James Honaker, Gary King, 2010
  • 11. Community Day 2022 August 7th, 2022 Conrad Bangalore Thank You /in/saradindusengupta @iamsaradindu /saradindusengupta