Unsupervised AI for Data Quality

•

0 likes•76 views

This document discusses unsupervised machine learning approaches for detecting outliers and data defects, including clustering, PCA, and autoencoders. It proposes using an autoencoder model to learn hidden patterns in unlabeled data attributes and then detect invalid records that do not conform to the learned patterns. The approach involves preparing AutoML pipelines for clustering, PCA, decision trees, and autoencoders, then using them in an ensemble model to generate rules for detecting outliers and data quality issues, which would be presented to users for validation.

Data & Analytics

AI for Data Quality:
Epic 3a. Unsupervised methods

Data Defects
●
Outlier data means this data is totally diﬀerent from the others
●
The key idea of this approach is to automatically discover domain-specific
patterns and generate rules for outliers
●
There are many outlier detection approaches including probabilistic and
statistical models, linear correlation analysis, proximity-based detection,
and supervised outlier detection
●
An unsupervised machine learning approach called autoencoder will be used
to learn hidden patterns in the attributes of the unlabelled data records
Country City Postcode Count
USA Michigan 100
UK Birmingham DY4 200
USA Birmingham 35251 100
USA Birmingham B1 3

Clustering
●
Clustering is an unsupervised technique that has
been widely used to investigate properties of the
data through grouping similar data into several
categories.
●
The similarity of the records are measured using
distance functions, such as Euclidean and
Manhattan distances.
●
Distance-based clustering algorithms cannot
derive the complex non-linear relationships that
exist among attributes of the data.

PCA
●
Principal Component Analysis (PCA) is a
representation learning approach that investigates
the relationships among the data attributes.
●
With this approach the features of correlated
attributes are converted into a set of linearly
uncorrelated attributes called principal
components.
●
The PCA representation learning can only
investigate the linear relationships among the
attributes.

Autoencoder
●
Autoencoder is a type of neural network that eﬀiciently models complex
associations among attributes of the data through the composition of
several layers of non-linearity
●
A trained autoencoder model with its parameters learned to best describe
the patterns in the input data records:
– 1. Reconstructs the data records using the patterns discovered during the training
– 2. Detects invalid records - records that do not conform to the patterns discovered by
the autoencoder

Proposed approach
●
Prepare AutoML pipelines
for Cluster, PCA, Decision
Tree and Autoencoder
models
●
Use them in an ensemble
model (MajorityVote or
SVM)
●
Generate AutoDQ rules and
present them for User
validation in natural
language ordered by some
metric/confidence level
“City Birmingham with postcode not
typical for UK West Midland should
be flagged as an error with proposed
correction for country column =
‘USA’.”

Similar to Unsupervised AI for Data Quality

Introduction to image processing and pattern recognitionSaibee Alam

Tasks amenable to AI automation in data science _.pptxMirzaJahanzeb5

C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHmIJCI JOURNAL

CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGESIRJET Journal

Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi

Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor

College_Tech-seminar_2024_Indrajith.pptxIndrajithN1

Pathway and network analysisManar Al-Eslam Mattar

introduction to Statistical Theory.pptxDr.Shweta

(Faiz) MachineLearning(ppt).pptxFaiz430036

DATA MINING.pptxDipankar Boruah

Descriptive m0delingMuluken Sholaye Tesfaye

Guiding through a typical Machine Learning PipelineMichael Gerke

Study and development of methods and tools for testing, validation and verif...Emilio Serrano

Machine learninghplap

IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...IRJET Journal

C3 w5Ajay Taneja

Data Mining Module 2 Business Analytics.Jayanti Pande

Dive into Machine Learning Event MUGDSC.pptxRakshaAgrawal21

Dive into Machine Learning Event--MUGDSCRakshaAgrawal21

Similar to Unsupervised AI for Data Quality (20)

Introduction to image processing and pattern recognition

Tasks amenable to AI automation in data science _.pptx

C LUSTERING B ASED A TTRIBUTE S UBSET S ELECTION U SING F AST A LGORITHm

CASE STUDY: ADMISSION PREDICTION IN ENGINEERING AND TECHNOLOGY COLLEGES

Responsible AI in Industry: Practical Challenges and Lessons Learned

Analysis on different Data mining Techniques and algorithms used in IOT

College_Tech-seminar_2024_Indrajith.pptx

Pathway and network analysis

introduction to Statistical Theory.pptx

(Faiz) MachineLearning(ppt).pptx

DATA MINING.pptx

Descriptive m0deling

Guiding through a typical Machine Learning Pipeline

Study and development of methods and tools for testing, validation and verif...

Machine learning

IRJET- Prediction of Crime Rate Analysis using Supervised Classification Mach...

C3 w5

Data Mining Module 2 Business Analytics.

Dive into Machine Learning Event MUGDSC.pptx

Dive into Machine Learning Event--MUGDSC

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

E-Commerce Order PredictionShraddha Kamble.pptxBoston Institute of Analytics

Industrialised data - the key to AI success.pdfLars Albertsson

RadioAdProWritingCinderellabyButleri.pdfgstagge

Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Data Warehouse , Data Cube Computationsit20ad004

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha

Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor

Recently uploaded (20)

100-Concepts-of-AI by Anupama Kate .pptx

E-Commerce Order PredictionShraddha Kamble.pptx

Industrialised data - the key to AI success.pdf

RadioAdProWritingCinderellabyButleri.pdf

Data Science Project: Advancements in Fetal Health Classification

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Data Warehouse , Data Cube Computation

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...

Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...

Customer Service Analytics - Make Sense of All Your Data.pptx

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

RA-11058_IRR-COMPRESS Do 198 series of 1998

04242024_CCC TUG_Joins and Relationships

Dubai Call Girls Wifey O52&786472 Call Girls Dubai

Schema on read is obsolete. Welcome metaprogramming..pdf

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...

Unsupervised AI for Data Quality

1. AI for Data Quality: Epic 3a. Unsupervised methods

2. Data Defects ● Outlier data means this data is totally diﬀerent from the others ● The key idea of this approach is to automatically discover domain-specific patterns and generate rules for outliers ● There are many outlier detection approaches including probabilistic and statistical models, linear correlation analysis, proximity-based detection, and supervised outlier detection ● An unsupervised machine learning approach called autoencoder will be used to learn hidden patterns in the attributes of the unlabelled data records Country City Postcode Count USA Michigan 100 UK Birmingham DY4 200 USA Birmingham 35251 100 USA Birmingham B1 3

3. Clustering ● Clustering is an unsupervised technique that has been widely used to investigate properties of the data through grouping similar data into several categories. ● The similarity of the records are measured using distance functions, such as Euclidean and Manhattan distances. ● Distance-based clustering algorithms cannot derive the complex non-linear relationships that exist among attributes of the data.

4. PCA ● Principal Component Analysis (PCA) is a representation learning approach that investigates the relationships among the data attributes. ● With this approach the features of correlated attributes are converted into a set of linearly uncorrelated attributes called principal components. ● The PCA representation learning can only investigate the linear relationships among the attributes.

5. Autoencoder ● Autoencoder is a type of neural network that eﬀiciently models complex associations among attributes of the data through the composition of several layers of non-linearity ● A trained autoencoder model with its parameters learned to best describe the patterns in the input data records: – 1. Reconstructs the data records using the patterns discovered during the training – 2. Detects invalid records - records that do not conform to the patterns discovered by the autoencoder

6. Proposed approach ● Prepare AutoML pipelines for Cluster, PCA, Decision Tree and Autoencoder models ● Use them in an ensemble model (MajorityVote or SVM) ● Generate AutoDQ rules and present them for User validation in natural language ordered by some metric/confidence level “City Birmingham with postcode not typical for UK West Midland should be flagged as an error with proposed correction for country column = ‘USA’.”

Unsupervised AI for Data Quality

Recommended

Recommended

More Related Content

Similar to Unsupervised AI for Data Quality

Similar to Unsupervised AI for Data Quality (20)

More from Vera Ekimenko

More from Vera Ekimenko (13)

Recently uploaded

Recently uploaded (20)

Unsupervised AI for Data Quality