SlideShare a Scribd company logo
 AI Data Quality solution
 Nuclio (open source)
 High Performance
 No Lock-Ins
 GPU Support
 Data Profiling and Rules configuration
 Data Quality Rules generation and
execution
 AI assisted Data Quality
Reconciliation
 Golden Records
 Cross System Checks
 Data Quality Dashboard with Pivot
 Daily/Weekly Root-Cause Analysis
Dashboards with generated narrative
(Stories)
 Data Quality AI-led User
Workflows/Writeback
 3D Graph visualisation of relations
2. Data Quality AI services
1. Serverless platform
3. Analytics and User Workspace
Serverless Platform (1/2)
 Why Nuclio (open source serverless platform):
 The only serverless framework with GPU support and fast file access
 High-performance parallel execution engine
 Running models as a function in a serving layer (instead of running it in a 3rd party container
)
 Easy to use interface for controlling GPU resources per function
Serverless Platform (1/2)
AI Data Profiling (1/2)
Data Profiling can be run interactively by User and in unsupervised mode to produce
suggestions, generate Data Quality rules and execution settings (thresholds, patters,
normality)
AI Data Profiling (2/2)
AI Data Quality AI (1/6)
Why another Data Quality platform?
Non-AI Data Quality ML Core Data Quality AI
Linear Processing Parallel and real-time / stream processing
with serverless micro-service architecture
Regular Algorithms Superior parallel algorithms with lower
complexity than N**2 is required for truly
scalable applications
High Maintenance Rules Data Quality team will manage services /
algorithms / bots not Data
Manual Operations. Updates and
maintenance are sporadic, error-prone,
resource constrained
Any scalable system must perform the vast
majority of its operations automatically. Only
AI can scale to the level required by large
enterprises.
AI Data Quality AI (2/6)
AI Data Quality AI services will use probabilistic algorithms applied
uniquely to every input data set:
 Rules identification and generation
 Hyper Fingerprinting
 a unique profile signature based on column properties
 New data identification and automatic rules execution
 Numerical and string drift in the existing data
 Record Anomalies identification
 Fingerprint Maintenance/Evolution via DQ Knowledge Graph
How fingerprinting works
AI Data Quality AI (3/6)
AI Data Quality AI (4/6)
Single Columns – Cardinalities
(1) Number of rows
(2) Number of null values
(3) Percentage of null values
(4) Number of distinct values; sometimes
called “cardinality”
(5) Number of distinct values divided by the
number of rows
Single Columns - Value distributions
(6) Frequency histograms (equi-width, equi-
depth)
(7) Minimum and maximum values in a numeric
column
(8) Constancy: Frequency of most frequent value
divided by number of rows
(9) Quartiles: 3 points that divide the (numeric)
values into 4 equal groups
(10) Distribution of first digit in numeric values; to
check Benford’s law
Single Columns - Patterns, data
types, and domains
(11) Basic type (e.g., numeric,
alphanumeric, date, time)
(12) DBMS-specific data type (e.g.,
varchar, timestamp)
(13) Measurement of value length
(minimum, maximum, average, and
median)
(14) Maximum number of digits in
numeric values
(15) Maximum number of decimals in
numeric values
(16) Histogram of value patterns
(Aa9...)
(17) Generic semantic data type (e.g.,
code, date/time, quantity, identifier)
(18) Semantic domain (e.g., credit
card, first name, city)
Dependencies
(19) Unique column combinations (key
discovery)
(20) Relaxed unique column combinations
(21) Inclusion dependencies (foreign key
discovery)
(22) Relaxed inclusion dependencies
(23) Functional dependencies
(24) Conditional functional dependencies
Advanced Multi-Column profiling
(25) Correlation analysis
(26) Association rule mining
(27) Cluster analysis
(28) Outlier detection
(29) Exact duplicate tuple detection
(30) Relaxed duplicate tuple
detection
AI Data Quality AI (5/6)
Reasonable values / normality
(31) Record Anomaly
(32) Record Count Anomaly
(33) Numerical Column Drift
(34) String column Unique Value
Drift
(35) String column value drift
Category Column / condition Rule Parameters
Uniqueness Loan Number Cannot be duplicate
Completeness Loan Closing Date Cannot be Null
Conformity Loan Closing Date Valid Format yyyyMMdd
Validity IF `Property State`=GAAND `Loan Source`=4
AND `Product Type`=1
THEN `Investor Type`=3
Drift Income Documentation Acceptable Values 1,3,4,6
Timeliness First_Payment_Date must be within 90 days of the
Loan_Closing_Date
Consistency Differences between Original_Credit_Score –
Current-Credit_Score must have
Lower_Limit: -221.2 Upper_Limit:207.2
Accuracy IF `Property State`=GAAND `Loan Source`=4
AND `Product Type`=1 AND `Investor Type`=3
Then `Unpaid Principal Balance` will have the
following range
Lower_Limit:0 Upper_Limit:260853
Accuracy IF `Property State`=CT AND `Loan Source`=2
AND `Product Type`=6 AND `Investor Type`=7
Then `Unpaid Principal Balance` will have the
following range
Lower_Limit:0 Upper_Limit:1
AI Data Quality AI (6/6)
Examples of rules generated unsupervised with AI Data Quality AI services
AI Cross System Checks (1/3)
AI Cross System Checks (2/3)
AI Cross System Checks (3/3)
AI Data Quality Rules Generation
AI Golden Records
Metrics
Category
Description
Column Profiling What is the data’s physical characteristics? Across multiple tables?
Relationship What relationships exist in the data set? Across multiple tables?
Redundancy What data is redundant? Orphan Analysis
Completeness What data is missing or unusable?
Conformity What data is stored in a non-standard format?
Consistency What data gives conflicting information?
Accuracy What data is incorrect or out of date?
Duplication What data records are duplicated?
Integrity What data is missing important relationship linkages?
Range What scores, values, calculations are outside of range?
AI Data Quality Dashboard (1/4)
Data Quality Metrics
AI Data Quality Dashboard (1/3)
Kibana-based dashboards:
- easy to built/change but limited
customisation
- drill down functionality
- can be displayed on office screens
Aggregation
Aggregation
Visualisation
Columns
AI Data Quality Dashboard (1/3)
Filters
Pivot Table
(OLAP)
(full customisation)
AI Data Quality Dashboard (2/3)
Click!
Click!
Click!
Drill Down functionality to the bottom of problem via preset paths and ad-hock
investigation
AI Data Quality Dashboard (3/3)
- 3D Graph chart to visualize and explore the inter-relationship between records /
columns
- Integral part of AI Data Quality Knowledge Base

More Related Content

What's hot

Data Cleansing
Data CleansingData Cleansing
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
markgrover
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund Analytics
Bernardo Najlis
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integrationnguyenfilip
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from Microsoft
Łukasz Grala
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
Shesha R
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
Rajendran
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
Cambridge Semantics
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
John Mulhall
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
Julien Le Dem
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
Shesha R
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentu
Łukasz Grala
 
Datamining
DataminingDatamining
Dataminingsumit621
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
Julien Le Dem
 
Distributed Logistic Model Trees
Distributed Logistic Model TreesDistributed Logistic Model Trees
Distributed Logistic Model Trees
Stratio
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
Safe Software
 

What's hot (20)

Gopi
GopiGopi
Gopi
 
Data Cleansing
Data CleansingData Cleansing
Data Cleansing
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
Investment Fund Analytics
Investment Fund AnalyticsInvestment Fund Analytics
Investment Fund Analytics
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integration
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from Microsoft
 
Online retail a look at data consulting approach
Online retail   a look at data consulting approachOnline retail   a look at data consulting approach
Online retail a look at data consulting approach
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Ghhh
GhhhGhhh
Ghhh
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
 
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdfHUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
HUGIreland_CronanMcNamara_DataScience_ExpertModels.pdf
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
WhyR? Analiza sentymentu
WhyR? Analiza sentymentuWhyR? Analiza sentymentu
WhyR? Analiza sentymentu
 
Datamining
DataminingDatamining
Datamining
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Distributed Logistic Model Trees
Distributed Logistic Model TreesDistributed Logistic Model Trees
Distributed Logistic Model Trees
 
Spatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data SharingSpatial ETL For Web Services-Based Data Sharing
Spatial ETL For Web Services-Based Data Sharing
 

Similar to Data Quality with AI

Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
Zenodia Charpy
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
Sri Ambati
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
Abhishek M Shivalingaiah
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
Peter Gfader
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
Marco Parenzan
 
big-data-anallytics.pptx
big-data-anallytics.pptxbig-data-anallytics.pptx
big-data-anallytics.pptx
Sangamesh Kalyan
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
Ruhani Arora
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data.
Keshav Murthy
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
Rising Media, Inc.
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
Eduardo Castro
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
WSO2
 
Fluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data ConclaveFluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data Conclave
fluturads
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
Manuel Martín
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
DataWorks Summit/Hadoop Summit
 
Cassandra
CassandraCassandra
Cassandra
Lucian Neghina
 

Similar to Data Quality with AI (20)

Meetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo casesMeetup sthlm - introduction to Machine Learning with demo cases
Meetup sthlm - introduction to Machine Learning with demo cases
 
Dive into H2O: NYC
Dive into H2O: NYCDive into H2O: NYC
Dive into H2O: NYC
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Azure CosmosDb - Where we are
Azure CosmosDb - Where we areAzure CosmosDb - Where we are
Azure CosmosDb - Where we are
 
big-data-anallytics.pptx
big-data-anallytics.pptxbig-data-anallytics.pptx
big-data-anallytics.pptx
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
1120 track2 komp
1120 track2 komp1120 track2 komp
1120 track2 komp
 
Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data. Accelerating analytics on the Sensor and IoT Data.
Accelerating analytics on the Sensor and IoT Data.
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech IndustryUsing Graph Analysis and Fraud Detection in the Fintech Industry
Using Graph Analysis and Fraud Detection in the Fintech Industry
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps[WSO2Con Asia 2018] Patterns for Building Streaming Apps
[WSO2Con Asia 2018] Patterns for Building Streaming Apps
 
Fluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data ConclaveFluturas presentation @ Big Data Conclave
Fluturas presentation @ Big Data Conclave
 
Artificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data AnalysisArtificial Intelligence for Automating Data Analysis
Artificial Intelligence for Automating Data Analysis
 
Big Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud DetectionBig Data Application Architectures - Fraud Detection
Big Data Application Architectures - Fraud Detection
 
Cassandra
CassandraCassandra
Cassandra
 

More from Vera Ekimenko

AML Knowledge Graph
AML Knowledge GraphAML Knowledge Graph
AML Knowledge Graph
Vera Ekimenko
 
Deep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationDeep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio Optimization
Vera Ekimenko
 
Unsupervised AI for Data Quality
Unsupervised AI for Data QualityUnsupervised AI for Data Quality
Unsupervised AI for Data Quality
Vera Ekimenko
 
Deep Learning Hackathon
Deep Learning HackathonDeep Learning Hackathon
Deep Learning Hackathon
Vera Ekimenko
 
Cloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineCloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipeline
Vera Ekimenko
 
Artificial Intelligence Hackathon
Artificial Intelligence HackathonArtificial Intelligence Hackathon
Artificial Intelligence Hackathon
Vera Ekimenko
 
DWHRestructure
DWHRestructureDWHRestructure
DWHRestructure
Vera Ekimenko
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecastVera Ekimenko
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingVera Ekimenko
 
HCM Access Insight Dashboard
HCM Access Insight DashboardHCM Access Insight Dashboard
HCM Access Insight Dashboard
Vera Ekimenko
 

More from Vera Ekimenko (12)

AML Knowledge Graph
AML Knowledge GraphAML Knowledge Graph
AML Knowledge Graph
 
Deep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio OptimizationDeep Reinforcement Learning for Portfolio Optimization
Deep Reinforcement Learning for Portfolio Optimization
 
Unsupervised AI for Data Quality
Unsupervised AI for Data QualityUnsupervised AI for Data Quality
Unsupervised AI for Data Quality
 
Deep Learning Hackathon
Deep Learning HackathonDeep Learning Hackathon
Deep Learning Hackathon
 
Cloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipelineCloudera migration oozie_hadoop_ci_cd_pipeline
Cloudera migration oozie_hadoop_ci_cd_pipeline
 
Artificial Intelligence Hackathon
Artificial Intelligence HackathonArtificial Intelligence Hackathon
Artificial Intelligence Hackathon
 
CSharp
CSharpCSharp
CSharp
 
DWHRestructure
DWHRestructureDWHRestructure
DWHRestructure
 
KeyAchivementsMimecast
KeyAchivementsMimecastKeyAchivementsMimecast
KeyAchivementsMimecast
 
KeyAchivementsJustisPublishing
KeyAchivementsJustisPublishingKeyAchivementsJustisPublishing
KeyAchivementsJustisPublishing
 
buy_in
buy_inbuy_in
buy_in
 
HCM Access Insight Dashboard
HCM Access Insight DashboardHCM Access Insight Dashboard
HCM Access Insight Dashboard
 

Recently uploaded

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 

Recently uploaded (20)

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 

Data Quality with AI

  • 1.  AI Data Quality solution  Nuclio (open source)  High Performance  No Lock-Ins  GPU Support  Data Profiling and Rules configuration  Data Quality Rules generation and execution  AI assisted Data Quality Reconciliation  Golden Records  Cross System Checks  Data Quality Dashboard with Pivot  Daily/Weekly Root-Cause Analysis Dashboards with generated narrative (Stories)  Data Quality AI-led User Workflows/Writeback  3D Graph visualisation of relations 2. Data Quality AI services 1. Serverless platform 3. Analytics and User Workspace
  • 3.  Why Nuclio (open source serverless platform):  The only serverless framework with GPU support and fast file access  High-performance parallel execution engine  Running models as a function in a serving layer (instead of running it in a 3rd party container )  Easy to use interface for controlling GPU resources per function Serverless Platform (1/2)
  • 4. AI Data Profiling (1/2) Data Profiling can be run interactively by User and in unsupervised mode to produce suggestions, generate Data Quality rules and execution settings (thresholds, patters, normality)
  • 6. AI Data Quality AI (1/6) Why another Data Quality platform? Non-AI Data Quality ML Core Data Quality AI Linear Processing Parallel and real-time / stream processing with serverless micro-service architecture Regular Algorithms Superior parallel algorithms with lower complexity than N**2 is required for truly scalable applications High Maintenance Rules Data Quality team will manage services / algorithms / bots not Data Manual Operations. Updates and maintenance are sporadic, error-prone, resource constrained Any scalable system must perform the vast majority of its operations automatically. Only AI can scale to the level required by large enterprises.
  • 7. AI Data Quality AI (2/6) AI Data Quality AI services will use probabilistic algorithms applied uniquely to every input data set:  Rules identification and generation  Hyper Fingerprinting  a unique profile signature based on column properties  New data identification and automatic rules execution  Numerical and string drift in the existing data  Record Anomalies identification  Fingerprint Maintenance/Evolution via DQ Knowledge Graph
  • 8. How fingerprinting works AI Data Quality AI (3/6)
  • 9. AI Data Quality AI (4/6) Single Columns – Cardinalities (1) Number of rows (2) Number of null values (3) Percentage of null values (4) Number of distinct values; sometimes called “cardinality” (5) Number of distinct values divided by the number of rows Single Columns - Value distributions (6) Frequency histograms (equi-width, equi- depth) (7) Minimum and maximum values in a numeric column (8) Constancy: Frequency of most frequent value divided by number of rows (9) Quartiles: 3 points that divide the (numeric) values into 4 equal groups (10) Distribution of first digit in numeric values; to check Benford’s law Single Columns - Patterns, data types, and domains (11) Basic type (e.g., numeric, alphanumeric, date, time) (12) DBMS-specific data type (e.g., varchar, timestamp) (13) Measurement of value length (minimum, maximum, average, and median) (14) Maximum number of digits in numeric values (15) Maximum number of decimals in numeric values (16) Histogram of value patterns (Aa9...) (17) Generic semantic data type (e.g., code, date/time, quantity, identifier) (18) Semantic domain (e.g., credit card, first name, city)
  • 10. Dependencies (19) Unique column combinations (key discovery) (20) Relaxed unique column combinations (21) Inclusion dependencies (foreign key discovery) (22) Relaxed inclusion dependencies (23) Functional dependencies (24) Conditional functional dependencies Advanced Multi-Column profiling (25) Correlation analysis (26) Association rule mining (27) Cluster analysis (28) Outlier detection (29) Exact duplicate tuple detection (30) Relaxed duplicate tuple detection AI Data Quality AI (5/6) Reasonable values / normality (31) Record Anomaly (32) Record Count Anomaly (33) Numerical Column Drift (34) String column Unique Value Drift (35) String column value drift
  • 11. Category Column / condition Rule Parameters Uniqueness Loan Number Cannot be duplicate Completeness Loan Closing Date Cannot be Null Conformity Loan Closing Date Valid Format yyyyMMdd Validity IF `Property State`=GAAND `Loan Source`=4 AND `Product Type`=1 THEN `Investor Type`=3 Drift Income Documentation Acceptable Values 1,3,4,6 Timeliness First_Payment_Date must be within 90 days of the Loan_Closing_Date Consistency Differences between Original_Credit_Score – Current-Credit_Score must have Lower_Limit: -221.2 Upper_Limit:207.2 Accuracy IF `Property State`=GAAND `Loan Source`=4 AND `Product Type`=1 AND `Investor Type`=3 Then `Unpaid Principal Balance` will have the following range Lower_Limit:0 Upper_Limit:260853 Accuracy IF `Property State`=CT AND `Loan Source`=2 AND `Product Type`=6 AND `Investor Type`=7 Then `Unpaid Principal Balance` will have the following range Lower_Limit:0 Upper_Limit:1 AI Data Quality AI (6/6) Examples of rules generated unsupervised with AI Data Quality AI services
  • 12. AI Cross System Checks (1/3)
  • 13. AI Cross System Checks (2/3)
  • 14. AI Cross System Checks (3/3)
  • 15. AI Data Quality Rules Generation
  • 17. Metrics Category Description Column Profiling What is the data’s physical characteristics? Across multiple tables? Relationship What relationships exist in the data set? Across multiple tables? Redundancy What data is redundant? Orphan Analysis Completeness What data is missing or unusable? Conformity What data is stored in a non-standard format? Consistency What data gives conflicting information? Accuracy What data is incorrect or out of date? Duplication What data records are duplicated? Integrity What data is missing important relationship linkages? Range What scores, values, calculations are outside of range? AI Data Quality Dashboard (1/4) Data Quality Metrics
  • 18. AI Data Quality Dashboard (1/3) Kibana-based dashboards: - easy to built/change but limited customisation - drill down functionality - can be displayed on office screens
  • 19. Aggregation Aggregation Visualisation Columns AI Data Quality Dashboard (1/3) Filters Pivot Table (OLAP) (full customisation)
  • 20. AI Data Quality Dashboard (2/3) Click! Click! Click! Drill Down functionality to the bottom of problem via preset paths and ad-hock investigation
  • 21. AI Data Quality Dashboard (3/3) - 3D Graph chart to visualize and explore the inter-relationship between records / columns - Integral part of AI Data Quality Knowledge Base