SlideShare a Scribd company logo
1 of 21
 AI Data Quality solution
 Nuclio (open source)
 High Performance
 No Lock-Ins
 GPU Support
 Data Profiling and Rules configuration
 Data Quality Rules generation and
execution
 AI assisted Data Quality
Reconciliation
 Golden Records
 Cross System Checks
 Data Quality Dashboard with Pivot
 Daily/Weekly Root-Cause Analysis
Dashboards with generated narrative
(Stories)
 Data Quality AI-led User
Workflows/Writeback
 3D Graph visualisation of relations
2. Data Quality AI services
1. Serverless platform
3. Analytics and User Workspace
Serverless Platform (1/2)
 Why Nuclio (open source serverless platform):
 The only serverless framework with GPU support and fast file access
 High-performance parallel execution engine
 Running models as a function in a serving layer (instead of running it in a 3rd party container
)
 Easy to use interface for controlling GPU resources per function
Serverless Platform (1/2)
AI Data Profiling (1/2)
Data Profiling can be run interactively by User and in unsupervised mode to produce
suggestions, generate Data Quality rules and execution settings (thresholds, patters,
normality)
AI Data Profiling (2/2)
AI Data Quality AI (1/6)
Why another Data Quality platform?
Non-AI Data Quality ML Core Data Quality AI
Linear Processing Parallel and real-time / stream processing
with serverless micro-service architecture
Regular Algorithms Superior parallel algorithms with lower
complexity than N**2 is required for truly
scalable applications
High Maintenance Rules Data Quality team will manage services /
algorithms / bots not Data
Manual Operations. Updates and
maintenance are sporadic, error-prone,
resource constrained
Any scalable system must perform the vast
majority of its operations automatically. Only
AI can scale to the level required by large
enterprises.
AI Data Quality AI (2/6)
AI Data Quality AI services will use probabilistic algorithms applied
uniquely to every input data set:
 Rules identification and generation
 Hyper Fingerprinting
 a unique profile signature based on column properties
 New data identification and automatic rules execution
 Numerical and string drift in the existing data
 Record Anomalies identification
 Fingerprint Maintenance/Evolution via DQ Knowledge Graph
How fingerprinting works
AI Data Quality AI (3/6)
AI Data Quality AI (4/6)
Single Columns – Cardinalities
(1) Number of rows
(2) Number of null values
(3) Percentage of null values
(4) Number of distinct values; sometimes
called “cardinality”
(5) Number of distinct values divided by the
number of rows
Single Columns - Value distributions
(6) Frequency histograms (equi-width, equi-
depth)
(7) Minimum and maximum values in a numeric
column
(8) Constancy: Frequency of most frequent value
divided by number of rows
(9) Quartiles: 3 points that divide the (numeric)
values into 4 equal groups
(10) Distribution of first digit in numeric values; to
check Benford’s law
Single Columns - Patterns, data
types, and domains
(11) Basic type (e.g., numeric,
alphanumeric, date, time)
(12) DBMS-specific data type (e.g.,
varchar, timestamp)
(13) Measurement of value length
(minimum, maximum, average, and
median)
(14) Maximum number of digits in
numeric values
(15) Maximum number of decimals in
numeric values
(16) Histogram of value patterns
(Aa9...)
(17) Generic semantic data type (e.g.,
code, date/time, quantity, identifier)
(18) Semantic domain (e.g., credit
card, first name, city)
Dependencies
(19) Unique column combinations (key
discovery)
(20) Relaxed unique column combinations
(21) Inclusion dependencies (foreign key
discovery)
(22) Relaxed inclusion dependencies
(23) Functional dependencies
(24) Conditional functional dependencies
Advanced Multi-Column profiling
(25) Correlation analysis
(26) Association rule mining
(27) Cluster analysis
(28) Outlier detection
(29) Exact duplicate tuple detection
(30) Relaxed duplicate tuple
detection
AI Data Quality AI (5/6)
Reasonable values / normality
(31) Record Anomaly
(32) Record Count Anomaly
(33) Numerical Column Drift
(34) String column Unique Value
Drift
(35) String column value drift
Category Column / condition Rule Parameters
Uniqueness Loan Number Cannot be duplicate
Completeness Loan Closing Date Cannot be Null
Conformity Loan Closing Date Valid Format yyyyMMdd
Validity IF `Property State`=GAAND `Loan Source`=4
AND `Product Type`=1
THEN `Investor Type`=3
Drift Income Documentation Acceptable Values 1,3,4,6
Timeliness First_Payment_Date must be within 90 days of the
Loan_Closing_Date
Consistency Differences between Original_Credit_Score –
Current-Credit_Score must have
Lower_Limit: -221.2 Upper_Limit:207.2
Accuracy IF `Property State`=GAAND `Loan Source`=4
AND `Product Type`=1 AND `Investor Type`=3
Then `Unpaid Principal Balance` will have the
following range
Lower_Limit:0 Upper_Limit:260853
Accuracy IF `Property State`=CT AND `Loan Source`=2
AND `Product Type`=6 AND `Investor Type`=7
Then `Unpaid Principal Balance` will have the
following range
Lower_Limit:0 Upper_Limit:1
AI Data Quality AI (6/6)
Examples of rules generated unsupervised with AI Data Quality AI services
AI Cross System Checks (1/3)
AI Cross System Checks (2/3)
AI Cross System Checks (3/3)
AI Data Quality Rules Generation
AI Golden Records
Metrics
Category
Description
Column Profiling What is the data’s physical characteristics? Across multiple tables?
Relationship What relationships exist in the data set? Across multiple tables?
Redundancy What data is redundant? Orphan Analysis
Completeness What data is missing or unusable?
Conformity What data is stored in a non-standard format?
Consistency What data gives conflicting information?
Accuracy What data is incorrect or out of date?
Duplication What data records are duplicated?
Integrity What data is missing important relationship linkages?
Range What scores, values, calculations are outside of range?
AI Data Quality Dashboard (1/4)
Data Quality Metrics
AI Data Quality Dashboard (1/3)
Kibana-based dashboards:
- easy to built/change but limited
customisation
- drill down functionality
- can be displayed on office screens
Aggregation
Aggregation
Visualisation
Columns
AI Data Quality Dashboard (1/3)
Filters
Pivot Table
(OLAP)
(full customisation)
AI Data Quality Dashboard (2/3)
Click!
Click!
Click!
Drill Down functionality to the bottom of problem via preset paths and ad-hock
investigation
AI Data Quality Dashboard (3/3)
- 3D Graph chart to visualize and explore the inter-relationship between records /
columns
- Integral part of AI Data Quality Knowledge Base

More Related Content

What's hot

Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund AnalyticsBernardo Najlis
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integrationnguyenfilip
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL ServerŁukasz Grala
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftŁukasz Grala
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approachShesha R
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesRajendran
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphCambridge Semantics
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfJohn Mulhall
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentuŁukasz Grala
 
Datamining
DataminingDatamining
Dataminingsumit621
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Distributed Logistic Model Trees
Distributed Logistic Model TreesDistributed Logistic Model Trees
Distributed Logistic Model TreesStratio
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSafe Software
 

What's hot (20)

Gopi
GopiGopi
Gopi
 
Data Cleansing
Data CleansingData Cleansing
Data Cleansing
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund Analytics
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integration
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from Microsoft
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Ghhh
GhhhGhhh
Ghhh
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentu
 
Datamining
DataminingDatamining
Datamining
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Distributed Logistic Model Trees
Distributed Logistic Model TreesDistributed Logistic Model Trees
Distributed Logistic Model Trees
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
 

Similar to Data Quality with AI

Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesZenodia Charpy
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYCSri Ambati
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Peter Gfader
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we areMarco Parenzan
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Keshav Murthy
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Eduardo Castro
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryStanka Dalekova
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming AppsWSO2
 
Fluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data ConclaveFluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data Conclavefluturads
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisManuel Martín
 

Similar to Data Quality with AI (20)

Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
 
big-data-anallytics.pptx
big-data-anallytics.pptxbig-data-anallytics.pptx
big-data-anallytics.pptx
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
1120 track2 komp
1120 track2 komp1120 track2 komp
1120 track2 komp
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data.
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Fluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data ConclaveFluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data Conclave
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
 
Cassandra
CassandraCassandra
Cassandra
 

More from Vera Ekimenko

Deep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationDeep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationVera Ekimenko
 
Unsupervised AI for Data Quality
Unsupervised AI for Data QualityUnsupervised AI for Data Quality
Unsupervised AI for Data QualityVera Ekimenko
 
Deep Learning Hackathon
Deep Learning HackathonDeep Learning Hackathon
Deep Learning HackathonVera Ekimenko
 
Cloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineCloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineVera Ekimenko
 
Artificial Intelligence Hackathon
Artificial Intelligence HackathonArtificial Intelligence Hackathon
Artificial Intelligence HackathonVera Ekimenko
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecastVera Ekimenko
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingVera Ekimenko
 
HCM Access Insight Dashboard
HCM Access Insight DashboardHCM Access Insight Dashboard
HCM Access Insight DashboardVera Ekimenko
 

More from Vera Ekimenko (12)

AML Knowledge Graph
AML Knowledge GraphAML Knowledge Graph
AML Knowledge Graph
 
Deep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationDeep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio Optimization
 
Unsupervised AI for Data Quality
Unsupervised AI for Data QualityUnsupervised AI for Data Quality
Unsupervised AI for Data Quality
 
Deep Learning Hackathon
Deep Learning HackathonDeep Learning Hackathon
Deep Learning Hackathon
 
Cloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineCloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipeline
 
Artificial Intelligence Hackathon
Artificial Intelligence HackathonArtificial Intelligence Hackathon
Artificial Intelligence Hackathon
 
CSharp
CSharpCSharp
CSharp
 
DWHRestructure
DWHRestructureDWHRestructure
DWHRestructure
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecast
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishing
 
buy_in
buy_inbuy_in
buy_in
 
HCM Access Insight Dashboard
HCM Access Insight DashboardHCM Access Insight Dashboard
HCM Access Insight Dashboard
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

Data Quality with AI

  • 1.  AI Data Quality solution  Nuclio (open source)  High Performance  No Lock-Ins  GPU Support  Data Profiling and Rules configuration  Data Quality Rules generation and execution  AI assisted Data Quality Reconciliation  Golden Records  Cross System Checks  Data Quality Dashboard with Pivot  Daily/Weekly Root-Cause Analysis Dashboards with generated narrative (Stories)  Data Quality AI-led User Workflows/Writeback  3D Graph visualisation of relations 2. Data Quality AI services 1. Serverless platform 3. Analytics and User Workspace
  • 3.  Why Nuclio (open source serverless platform):  The only serverless framework with GPU support and fast file access  High-performance parallel execution engine  Running models as a function in a serving layer (instead of running it in a 3rd party container )  Easy to use interface for controlling GPU resources per function Serverless Platform (1/2)
  • 4. AI Data Profiling (1/2) Data Profiling can be run interactively by User and in unsupervised mode to produce suggestions, generate Data Quality rules and execution settings (thresholds, patters, normality)
  • 6. AI Data Quality AI (1/6) Why another Data Quality platform? Non-AI Data Quality ML Core Data Quality AI Linear Processing Parallel and real-time / stream processing with serverless micro-service architecture Regular Algorithms Superior parallel algorithms with lower complexity than N**2 is required for truly scalable applications High Maintenance Rules Data Quality team will manage services / algorithms / bots not Data Manual Operations. Updates and maintenance are sporadic, error-prone, resource constrained Any scalable system must perform the vast majority of its operations automatically. Only AI can scale to the level required by large enterprises.
  • 7. AI Data Quality AI (2/6) AI Data Quality AI services will use probabilistic algorithms applied uniquely to every input data set:  Rules identification and generation  Hyper Fingerprinting  a unique profile signature based on column properties  New data identification and automatic rules execution  Numerical and string drift in the existing data  Record Anomalies identification  Fingerprint Maintenance/Evolution via DQ Knowledge Graph
  • 8. How fingerprinting works AI Data Quality AI (3/6)
  • 9. AI Data Quality AI (4/6) Single Columns – Cardinalities (1) Number of rows (2) Number of null values (3) Percentage of null values (4) Number of distinct values; sometimes called “cardinality” (5) Number of distinct values divided by the number of rows Single Columns - Value distributions (6) Frequency histograms (equi-width, equi- depth) (7) Minimum and maximum values in a numeric column (8) Constancy: Frequency of most frequent value divided by number of rows (9) Quartiles: 3 points that divide the (numeric) values into 4 equal groups (10) Distribution of first digit in numeric values; to check Benford’s law Single Columns - Patterns, data types, and domains (11) Basic type (e.g., numeric, alphanumeric, date, time) (12) DBMS-specific data type (e.g., varchar, timestamp) (13) Measurement of value length (minimum, maximum, average, and median) (14) Maximum number of digits in numeric values (15) Maximum number of decimals in numeric values (16) Histogram of value patterns (Aa9...) (17) Generic semantic data type (e.g., code, date/time, quantity, identifier) (18) Semantic domain (e.g., credit card, first name, city)
  • 10. Dependencies (19) Unique column combinations (key discovery) (20) Relaxed unique column combinations (21) Inclusion dependencies (foreign key discovery) (22) Relaxed inclusion dependencies (23) Functional dependencies (24) Conditional functional dependencies Advanced Multi-Column profiling (25) Correlation analysis (26) Association rule mining (27) Cluster analysis (28) Outlier detection (29) Exact duplicate tuple detection (30) Relaxed duplicate tuple detection AI Data Quality AI (5/6) Reasonable values / normality (31) Record Anomaly (32) Record Count Anomaly (33) Numerical Column Drift (34) String column Unique Value Drift (35) String column value drift
  • 11. Category Column / condition Rule Parameters Uniqueness Loan Number Cannot be duplicate Completeness Loan Closing Date Cannot be Null Conformity Loan Closing Date Valid Format yyyyMMdd Validity IF `Property State`=GAAND `Loan Source`=4 AND `Product Type`=1 THEN `Investor Type`=3 Drift Income Documentation Acceptable Values 1,3,4,6 Timeliness First_Payment_Date must be within 90 days of the Loan_Closing_Date Consistency Differences between Original_Credit_Score – Current-Credit_Score must have Lower_Limit: -221.2 Upper_Limit:207.2 Accuracy IF `Property State`=GAAND `Loan Source`=4 AND `Product Type`=1 AND `Investor Type`=3 Then `Unpaid Principal Balance` will have the following range Lower_Limit:0 Upper_Limit:260853 Accuracy IF `Property State`=CT AND `Loan Source`=2 AND `Product Type`=6 AND `Investor Type`=7 Then `Unpaid Principal Balance` will have the following range Lower_Limit:0 Upper_Limit:1 AI Data Quality AI (6/6) Examples of rules generated unsupervised with AI Data Quality AI services
  • 12. AI Cross System Checks (1/3)
  • 13. AI Cross System Checks (2/3)
  • 14. AI Cross System Checks (3/3)
  • 15. AI Data Quality Rules Generation
  • 17. Metrics Category Description Column Profiling What is the data’s physical characteristics? Across multiple tables? Relationship What relationships exist in the data set? Across multiple tables? Redundancy What data is redundant? Orphan Analysis Completeness What data is missing or unusable? Conformity What data is stored in a non-standard format? Consistency What data gives conflicting information? Accuracy What data is incorrect or out of date? Duplication What data records are duplicated? Integrity What data is missing important relationship linkages? Range What scores, values, calculations are outside of range? AI Data Quality Dashboard (1/4) Data Quality Metrics
  • 18. AI Data Quality Dashboard (1/3) Kibana-based dashboards: - easy to built/change but limited customisation - drill down functionality - can be displayed on office screens
  • 19. Aggregation Aggregation Visualisation Columns AI Data Quality Dashboard (1/3) Filters Pivot Table (OLAP) (full customisation)
  • 20. AI Data Quality Dashboard (2/3) Click! Click! Click! Drill Down functionality to the bottom of problem via preset paths and ad-hock investigation
  • 21. AI Data Quality Dashboard (3/3) - 3D Graph chart to visualize and explore the inter-relationship between records / columns - Integral part of AI Data Quality Knowledge Base