2
Data Quality
Automatic enforcement in real-time with machine learning.
Max Martynov, VP of Technology
Introducing Grid Dynamics technology services
Digital transformation Big data, real time analytics, ML & AI
Microservices replatforming DevOps & cloud enablement
Open Source Cloud-ready Scalable Automated
12 years of
experience in digital
transformation.
9.8
9.3 9.6 9.4
10.1
17.5
16.9
8.9 9.6 9.2
10.3
9.8
5.1
4.1 3.9 4 3.9
4.5
7.5 7.1
3.8 4.1 4 4.5
4.2
2.3
0
5
10
15
20
7/1/19 7/2/19 7/3/19 7/4/19 7/5/19 7/6/19 7/7/19 7/8/19 7/9/19 7/10/19 7/11/19 7/12/19 7/13/19 7/14/19
Retailer X, Daily Sales – Executive Summary
Revenue, $M Gross Profit, $M
weekend weekend
DBDBDB
EDW
Data Lake
EDW
DBDBDB FileFileFile
Cloud
Data Lake
EDW
DBDBDB FileFileFile MQ CloudCloud
AppAppAPI
Cloud
EDW
DBDBDB FileFileFile MQ CloudCloud
AppAppAPI
AppAppApp
Data Lake
1 0 1 1
0 1 1 0
1 0 1 0
1 0 0 1
1. Trust is hard to build and easy to lose.
2. Distrust in data slows down decisions.
3. Slow decisions prevent agility.
Data corruption reasons
1. Code 2. Data Sources 3. Infrastructure
Test environment
Input
Actual
Expected
ETL
code
compareTest suite
run test
Traditional approach to testing
Data quality goals
Detect
data corruption
Prevent
it from spreading
Alert
support team
Production data lake
Real-time data quality enforcement
DBDBDB
FileFileFile
MQ
AppAppAPI
data processing job
Production data lake
Real-time data quality enforcement
DBDBDB
FileFileFile
MQ
AppAppAPI
data processing job data quality job
Production data lake
Real-time data quality enforcement
DBDBDB
FileFileFile
MQ
AppAppAPI
data processing job data quality job
alert &
stop pipeline
alert &
continue
x
Data Lake
Data
source
data
1. Compare with
SoR
2. Validate
business rules
3. Data profiling and
anomaly detection
Main data processing pipeline
confidence
data
confidence
1. Control divergence from SoR
Data Lake
Data
source
Imported
dataset
Compare data in
SoR and data lake
1. Validate correctness of
import.
2. Prevent stale data.
3. Prevent corruption
accumulation in stream
processing use cases.
4. Check data before it gets in
the lake.
2. Validate business rules
Data Lake
Dataset
Check for nulls and
data ranges
1. Enforce schema.
2. Check for nulls.
3. Validate data ranges.
4. Specify and enforce
data invariants.
3. Anomaly detection
Data Lake
Dataset
1. Fully automatic data
quality enforcement.
2. Collect data profile,
metrics and statistics.
3. Train ML models.
4. Find anomalies in data.
Data profiling and
anomaly detection
Catalog
Inventory
Orders
Data Lake Data Quality
Reporting &
Alerting
Data Profile
Demo setup
23
Live demo
Anomaly detection example
Anomaly detection example: zooming in
Capabilities for enterprise data quality and governance
 Enables widespread adoption.
 Enforces enterprise-level controls
and data usage policies.
 Increases consistency and
confidence in decision making.
 Decreases the risk of regulatory
fines.
 Improves data security.
 Facilitates accountability for
information quality.
 Minimizes or eliminates efforts
duplication.
Data Governance Platform
Metadata
Management
Full-text
Search
Data Quality
Status
Schema /
Summary
Data
Profiling
Mapping
to
Glossary
Change
Log
Dependency
Detection
Consumers Flow Visualization
Glossary Portal
Knowledge
Base
Fields
Fingerprinting
Data Catalog
Dataset Profile
Lineage Dashboard
Data
Glossary
Data Quality
Access and
Security
Business
Rules
Anomaly
Detection
Alerting
Access Rules
Compliance
Policies
Policy Engine
www.griddynamics.com
Thank you!
28
Demo screenshots
Dataproc cluster
Aifrlow pipeline
Griffin measures
Anomaly detection: normal data
Anomaly detection: anomaly
Anomaly detection: return to normal
Anomaly detection: historical view
Anomaly detection (counts): anomaly & return
Uniqueness: normal data
Uniqueness: anomaly
Uniqueness: return to normal
Uniqueness: historical view
Nulls: normal data
Nulls: anomaly
Nulls: return to normal
Nulls: historical view
Ranges: historical view
Completeness: anomalies

Dynamic Talks: "Implementing data quality automation with open source stack" -Max Martynov