Customer Service Analytics - Make Sense of All Your Data.pptx
Data Quality with AI
1. AI Data Quality solution
Nuclio (open source)
High Performance
No Lock-Ins
GPU Support
Data Profiling and Rules configuration
Data Quality Rules generation and
execution
AI assisted Data Quality
Reconciliation
Golden Records
Cross System Checks
Data Quality Dashboard with Pivot
Daily/Weekly Root-Cause Analysis
Dashboards with generated narrative
(Stories)
Data Quality AI-led User
Workflows/Writeback
3D Graph visualisation of relations
2. Data Quality AI services
1. Serverless platform
3. Analytics and User Workspace
3. Why Nuclio (open source serverless platform):
The only serverless framework with GPU support and fast file access
High-performance parallel execution engine
Running models as a function in a serving layer (instead of running it in a 3rd party container
)
Easy to use interface for controlling GPU resources per function
Serverless Platform (1/2)
4. AI Data Profiling (1/2)
Data Profiling can be run interactively by User and in unsupervised mode to produce
suggestions, generate Data Quality rules and execution settings (thresholds, patters,
normality)
6. AI Data Quality AI (1/6)
Why another Data Quality platform?
Non-AI Data Quality ML Core Data Quality AI
Linear Processing Parallel and real-time / stream processing
with serverless micro-service architecture
Regular Algorithms Superior parallel algorithms with lower
complexity than N**2 is required for truly
scalable applications
High Maintenance Rules Data Quality team will manage services /
algorithms / bots not Data
Manual Operations. Updates and
maintenance are sporadic, error-prone,
resource constrained
Any scalable system must perform the vast
majority of its operations automatically. Only
AI can scale to the level required by large
enterprises.
7. AI Data Quality AI (2/6)
AI Data Quality AI services will use probabilistic algorithms applied
uniquely to every input data set:
Rules identification and generation
Hyper Fingerprinting
a unique profile signature based on column properties
New data identification and automatic rules execution
Numerical and string drift in the existing data
Record Anomalies identification
Fingerprint Maintenance/Evolution via DQ Knowledge Graph
9. AI Data Quality AI (4/6)
Single Columns – Cardinalities
(1) Number of rows
(2) Number of null values
(3) Percentage of null values
(4) Number of distinct values; sometimes
called “cardinality”
(5) Number of distinct values divided by the
number of rows
Single Columns - Value distributions
(6) Frequency histograms (equi-width, equi-
depth)
(7) Minimum and maximum values in a numeric
column
(8) Constancy: Frequency of most frequent value
divided by number of rows
(9) Quartiles: 3 points that divide the (numeric)
values into 4 equal groups
(10) Distribution of first digit in numeric values; to
check Benford’s law
Single Columns - Patterns, data
types, and domains
(11) Basic type (e.g., numeric,
alphanumeric, date, time)
(12) DBMS-specific data type (e.g.,
varchar, timestamp)
(13) Measurement of value length
(minimum, maximum, average, and
median)
(14) Maximum number of digits in
numeric values
(15) Maximum number of decimals in
numeric values
(16) Histogram of value patterns
(Aa9...)
(17) Generic semantic data type (e.g.,
code, date/time, quantity, identifier)
(18) Semantic domain (e.g., credit
card, first name, city)
11. Category Column / condition Rule Parameters
Uniqueness Loan Number Cannot be duplicate
Completeness Loan Closing Date Cannot be Null
Conformity Loan Closing Date Valid Format yyyyMMdd
Validity IF `Property State`=GAAND `Loan Source`=4
AND `Product Type`=1
THEN `Investor Type`=3
Drift Income Documentation Acceptable Values 1,3,4,6
Timeliness First_Payment_Date must be within 90 days of the
Loan_Closing_Date
Consistency Differences between Original_Credit_Score –
Current-Credit_Score must have
Lower_Limit: -221.2 Upper_Limit:207.2
Accuracy IF `Property State`=GAAND `Loan Source`=4
AND `Product Type`=1 AND `Investor Type`=3
Then `Unpaid Principal Balance` will have the
following range
Lower_Limit:0 Upper_Limit:260853
Accuracy IF `Property State`=CT AND `Loan Source`=2
AND `Product Type`=6 AND `Investor Type`=7
Then `Unpaid Principal Balance` will have the
following range
Lower_Limit:0 Upper_Limit:1
AI Data Quality AI (6/6)
Examples of rules generated unsupervised with AI Data Quality AI services
17. Metrics
Category
Description
Column Profiling What is the data’s physical characteristics? Across multiple tables?
Relationship What relationships exist in the data set? Across multiple tables?
Redundancy What data is redundant? Orphan Analysis
Completeness What data is missing or unusable?
Conformity What data is stored in a non-standard format?
Consistency What data gives conflicting information?
Accuracy What data is incorrect or out of date?
Duplication What data records are duplicated?
Integrity What data is missing important relationship linkages?
Range What scores, values, calculations are outside of range?
AI Data Quality Dashboard (1/4)
Data Quality Metrics
18. AI Data Quality Dashboard (1/3)
Kibana-based dashboards:
- easy to built/change but limited
customisation
- drill down functionality
- can be displayed on office screens
20. AI Data Quality Dashboard (2/3)
Click!
Click!
Click!
Drill Down functionality to the bottom of problem via preset paths and ad-hock
investigation
21. AI Data Quality Dashboard (3/3)
- 3D Graph chart to visualize and explore the inter-relationship between records /
columns
- Integral part of AI Data Quality Knowledge Base