SlideShare a Scribd company logo
1 of 6
Download to read offline
AI for Data Quality:
Epic 3a. Unsupervised methods
Data Defects
●
Outlier data means this data is totally different from the others
●
The key idea of this approach is to automatically discover domain-specific
patterns and generate rules for outliers
●
There are many outlier detection approaches including probabilistic and
statistical models, linear correlation analysis, proximity-based detection,
and supervised outlier detection
●
An unsupervised machine learning approach called autoencoder will be used
to learn hidden patterns in the attributes of the unlabelled data records
Country City Postcode Count
USA Michigan 100
UK Birmingham DY4 200
USA Birmingham 35251 100
USA Birmingham B1 3
Clustering
●
Clustering is an unsupervised technique that has
been widely used to investigate properties of the
data through grouping similar data into several
categories.
●
The similarity of the records are measured using
distance functions, such as Euclidean and
Manhattan distances.
●
Distance-based clustering algorithms cannot
derive the complex non-linear relationships that
exist among attributes of the data.
PCA
●
Principal Component Analysis (PCA) is a
representation learning approach that investigates
the relationships among the data attributes.
●
With this approach the features of correlated
attributes are converted into a set of linearly
uncorrelated attributes called principal
components.
●
The PCA representation learning can only
investigate the linear relationships among the
attributes.
Autoencoder
●
Autoencoder is a type of neural network that efficiently models complex
associations among attributes of the data through the composition of
several layers of non-linearity
●
A trained autoencoder model with its parameters learned to best describe
the patterns in the input data records:
– 1. Reconstructs the data records using the patterns discovered during the training
– 2. Detects invalid records - records that do not conform to the patterns discovered by
the autoencoder
Proposed approach
●
Prepare AutoML pipelines
for Cluster, PCA, Decision
Tree and Autoencoder
models
●
Use them in an ensemble
model (MajorityVote or
SVM)
●
Generate AutoDQ rules and
present them for User
validation in natural
language ordered by some
metric/confidence level
“City Birmingham with postcode not
typical for UK West Midland should
be flagged as an error with proposed
correction for country column =
‘USA’.”

More Related Content

Similar to Unsupervised AI for Data Quality

Introduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionIntroduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionSaibee Alam
 
Tasks amenable to AI automation in data science _.pptx
Tasks amenable to AI automation in data science _.pptxTasks amenable to AI automation in data science _.pptx
Tasks amenable to AI automation in data science _.pptxMirzaJahanzeb5
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
C LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHmC LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHm
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHmIJCI JOURNAL
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESIRJET Journal
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
 
College_Tech-seminar_2024_Indrajith.pptx
College_Tech-seminar_2024_Indrajith.pptxCollege_Tech-seminar_2024_Indrajith.pptx
College_Tech-seminar_2024_Indrajith.pptxIndrajithN1
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptxDr.Shweta
 
(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptxFaiz430036
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineMichael Gerke
 
Study and development of methods and tools for testing, validation and verif...
 Study and development of methods and tools for testing, validation and verif... Study and development of methods and tools for testing, validation and verif...
Study and development of methods and tools for testing, validation and verif...Emilio Serrano
 
Machine learning
Machine learningMachine learning
Machine learninghplap
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Jayanti Pande
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxRakshaAgrawal21
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCRakshaAgrawal21
 

Similar to Unsupervised AI for Data Quality (20)

Introduction to image processing and pattern recognition
Introduction to image processing and pattern recognitionIntroduction to image processing and pattern recognition
Introduction to image processing and pattern recognition
 
Tasks amenable to AI automation in data science _.pptx
Tasks amenable to AI automation in data science _.pptxTasks amenable to AI automation in data science _.pptx
Tasks amenable to AI automation in data science _.pptx
 
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
C LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHmC LUSTERING  B ASED  A TTRIBUTE  S UBSET  S ELECTION  U SING  F AST  A LGORITHm
C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm
 
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESCASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES
 
Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
College_Tech-seminar_2024_Indrajith.pptx
College_Tech-seminar_2024_Indrajith.pptxCollege_Tech-seminar_2024_Indrajith.pptx
College_Tech-seminar_2024_Indrajith.pptx
 
Pathway and network analysis
Pathway and network analysisPathway and network analysis
Pathway and network analysis
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
 
(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx(Faiz) MachineLearning(ppt).pptx
(Faiz) MachineLearning(ppt).pptx
 
DATA MINING.pptx
DATA MINING.pptxDATA MINING.pptx
DATA MINING.pptx
 
Descriptive m0deling
Descriptive m0delingDescriptive m0deling
Descriptive m0deling
 
Guiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning PipelineGuiding through a typical Machine Learning Pipeline
Guiding through a typical Machine Learning Pipeline
 
Study and development of methods and tools for testing, validation and verif...
 Study and development of methods and tools for testing, validation and verif... Study and development of methods and tools for testing, validation and verif...
Study and development of methods and tools for testing, validation and verif...
 
Machine learning
Machine learningMachine learning
Machine learning
 
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...
 
C3 w5
C3 w5C3 w5
C3 w5
 
Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.Data Mining Module 2 Business Analytics.
Data Mining Module 2 Business Analytics.
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptx
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSC
 

More from Vera Ekimenko

Data Quality with AI
Data Quality with AIData Quality with AI
Data Quality with AIVera Ekimenko
 
Deep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationDeep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationVera Ekimenko
 
Artificial Intelligence for Data Quality
Artificial Intelligence for Data QualityArtificial Intelligence for Data Quality
Artificial Intelligence for Data QualityVera Ekimenko
 
Deep Learning Hackathon
Deep Learning HackathonDeep Learning Hackathon
Deep Learning HackathonVera Ekimenko
 
Cloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineCloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineVera Ekimenko
 
Artificial Intelligence Hackathon
Artificial Intelligence HackathonArtificial Intelligence Hackathon
Artificial Intelligence HackathonVera Ekimenko
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecastVera Ekimenko
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingVera Ekimenko
 
HCM Access Insight Dashboard
HCM Access Insight DashboardHCM Access Insight Dashboard
HCM Access Insight DashboardVera Ekimenko
 

More from Vera Ekimenko (13)

Data Quality with AI
Data Quality with AIData Quality with AI
Data Quality with AI
 
AML Knowledge Graph
AML Knowledge GraphAML Knowledge Graph
AML Knowledge Graph
 
Deep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationDeep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio Optimization
 
Artificial Intelligence for Data Quality
Artificial Intelligence for Data QualityArtificial Intelligence for Data Quality
Artificial Intelligence for Data Quality
 
Deep Learning Hackathon
Deep Learning HackathonDeep Learning Hackathon
Deep Learning Hackathon
 
Cloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineCloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipeline
 
Artificial Intelligence Hackathon
Artificial Intelligence HackathonArtificial Intelligence Hackathon
Artificial Intelligence Hackathon
 
CSharp
CSharpCSharp
CSharp
 
DWHRestructure
DWHRestructureDWHRestructure
DWHRestructure
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecast
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishing
 
buy_in
buy_inbuy_in
buy_in
 
HCM Access Insight Dashboard
HCM Access Insight DashboardHCM Access Insight Dashboard
HCM Access Insight Dashboard
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 

Unsupervised AI for Data Quality

  • 1. AI for Data Quality: Epic 3a. Unsupervised methods
  • 2. Data Defects ● Outlier data means this data is totally different from the others ● The key idea of this approach is to automatically discover domain-specific patterns and generate rules for outliers ● There are many outlier detection approaches including probabilistic and statistical models, linear correlation analysis, proximity-based detection, and supervised outlier detection ● An unsupervised machine learning approach called autoencoder will be used to learn hidden patterns in the attributes of the unlabelled data records Country City Postcode Count USA Michigan 100 UK Birmingham DY4 200 USA Birmingham 35251 100 USA Birmingham B1 3
  • 3. Clustering ● Clustering is an unsupervised technique that has been widely used to investigate properties of the data through grouping similar data into several categories. ● The similarity of the records are measured using distance functions, such as Euclidean and Manhattan distances. ● Distance-based clustering algorithms cannot derive the complex non-linear relationships that exist among attributes of the data.
  • 4. PCA ● Principal Component Analysis (PCA) is a representation learning approach that investigates the relationships among the data attributes. ● With this approach the features of correlated attributes are converted into a set of linearly uncorrelated attributes called principal components. ● The PCA representation learning can only investigate the linear relationships among the attributes.
  • 5. Autoencoder ● Autoencoder is a type of neural network that efficiently models complex associations among attributes of the data through the composition of several layers of non-linearity ● A trained autoencoder model with its parameters learned to best describe the patterns in the input data records: – 1. Reconstructs the data records using the patterns discovered during the training – 2. Detects invalid records - records that do not conform to the patterns discovered by the autoencoder
  • 6. Proposed approach ● Prepare AutoML pipelines for Cluster, PCA, Decision Tree and Autoencoder models ● Use them in an ensemble model (MajorityVote or SVM) ● Generate AutoDQ rules and present them for User validation in natural language ordered by some metric/confidence level “City Birmingham with postcode not typical for UK West Midland should be flagged as an error with proposed correction for country column = ‘USA’.”