SlideShare a Scribd company logo
Data Quality with or
without Apache Spark
and its ecosystem
Serge Smertin
Sr. Resident Solutions Architect at
Databricks
▪ Intro
▪ Dimensions
▪ Frameworks
▪ TLDR
▪ Outro
About me
▪ Worked in all stages of data
lifecycle for the past 14 years
▪ Built data science platforms from
scratch
▪ Tracked cyber criminals through
massively scaled data forensics
▪ Built anti-PII analysis measures
for payments industry
▪ Bringing Databricks strategic
customers to next level as
full-time job now
Colleen Graham
“Performance Management Driving BI Spending”, InformationWeek, February 14, 2006
https://www.informationweek.com/performance-management-driving-bi-spending/d/d-id/10405
52
Data quality requires certain
level of sophistication within a
company to even understand
that it’s a problem.
Data
Catalogs
Data
Profiling
ETL
Quality Checks
Metrics repository
Alerting
Noise filtering
Dashboards Oncall
Data
Catalogs
Data
Profiling
ETL
Metrics repository
Alerting
Noise filtering
Dashboards Oncall
Completeness
Consistency
Uniqueness
Timeliness
Relevance
Accuracy
Validity
Quality Checks
Record
level
Database
level
- Stream-friendly
- Quarantine invalid data
- Debug and re-process
- Make sure to (re-)watch
“Make reliable ETL easy
on Delta Lake” talk
- Batch-friendly
- See health of the entire pipeline
- Detect processing anomalies
- Reconciliation testing
- Mutual information analysis
- This talk
Data owners and
Subject Matter
Experts define
ideal shape of the
data
May not fully cover
all aspects, when
number of
datasets is bigger
that SME team
Often is the only way
for larger orgs,
where expertise still
has to be developed
internally
May lead to
incomplete data
coverage and
missed signals about
problems in data
pipelines
Exploration
Expertise
Semi-supervised
code generation
based on data
profiling results
May overfit
alerting with rules
that are too strict
by default,
resulting in more
noise than signal
Automation
Few solutions exist in the open-source
community either in the form of libraries or
complete stand-alone platforms, which can be
used to assure a certain data quality, especially
when continuous imports happen.
“1” if check(s)
succeeded for a given
row. Result is
averaged.
Streaming friendly.
Success
Keys
Check compares
incoming batch
with existing
dataset - e.g.
unique keys
Domain
Keys
Materialised
synthetic
aggregations - e.g.
is this batch |2σ|
records different
than previous?
Dataset
Metrics
Repeat computation
in a separate,
simplified pipeline
and validate results -
e.g. double-entry
bookkeeping
Reconciliation
Tests
If you “build your own everything” - consider
embedding Deequ.It has has constraint
suggestion among advanced enterprise
features like data profiling and anomaly
detection out of the box, though documentation
is not that extensive. And you may want to fork it
internally.
Deequ code
generation
from pydeequ.suggestions import *
suggestionResult = (
ConstraintSuggestionRunner(spark)
.onData(spark.table('demo'))
.addConstraintRule(DEFAULT())
.run())
print('from pydeequ.checks import *')
print('check = (Check(spark, CheckLevel.Warning, "Generated check")')
for suggestion in suggestionResult['constraint_suggestions']:
if 'Fractional' in suggestion['suggesting_rule']:
continue
print(f' {suggestion["code_for_constraint"]}')
print(')')
from pydeequ.checks import *
check = (Check(spark, CheckLevel.Warning,
"Generated check")
.isComplete("b")
.isNonNegative("b")
.isComplete("a")
.isNonNegative("a")
.isUnique("a")
.hasCompleteness("c", lambda x: x >= 0.32,
"It should be above 0.32!"))
Great Expectations is less enterprise'y data
validation platform written in Python, that
focuses on supporting Apache Spark among
other data sources, like Postgres, Pandas,
BigQuery, and so on.
Pandas Profiling
▪ Exploratory Data Analysis
simplified by generating HTML
report
▪ Native bi-directional
integration with Great
Expectations
▪ great_expectations
profile DATASOURCE
▪ (pandas_profiling
.ProfileReport(pandas_df)
.to_expectation_suite())
https://pandas-profiling.github.io/pandas-profiling/
Apache Griffin may be the most
enterprise-oriented solution with user interface
available, given the fact it being Apache
top-level project and backed up by eBay since
2016, but it is not as easily embeddable into
existing applications, because it requires
standalone deployment along with JSON DSL
definitions for rules.
Completeness
SELECT AVG(IF(c IS NOT NULL, 1, 0)) AS isComplete FROM demo
Deequ
PySpark
Great Expectations
SQL
Uniqueness
SELECT (COUNT(DISTINCT c) / COUNT(1)) AS isUnique FROM demo
Deequ
Great Expectations
PySpark
SQL
Validity
SELECT AVG(IF(a < b, 1, 0)) AS isValid FROM demo
Deequ
Great Expectations
PySpark
SQL
Timeliness
SELECT NOW() - MAX(rawEventTime) AS delay
FROM processed_events
raw events processed
events
Honorable Mentions
• https://github.com/FRosner/drunken-data-quality
• https://github.com/databrickslabs/dataframe-rules-engine
Make sure to (re-)watch
“Make reliable ETL easy
on Delta Lake” talk
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
Mark Kromer
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
Alexey Grishchenko
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
James Serra
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
Databricks
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
Jeffrey T. Pollock
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
Ilham31574
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 

What's hot (20)

Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Technical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdfTechnical Deck Delta Live Tables.pdf
Technical Deck Delta Live Tables.pdf
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 

Similar to Data Quality With or Without Apache Spark and Its Ecosystem

AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
Intel® Software
 
Testing Big Data solutions fast and furiously
Testing Big Data solutions fast and furiouslyTesting Big Data solutions fast and furiously
Testing Big Data solutions fast and furiously
Katherine Golovinova
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Christopher Gutknecht
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Databricks
 
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiouslyQA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QAFest
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
Neo4j
 
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
William Lee
 
Scalable and Resilient Security Ratings Platform with ScyllaDB
Scalable and Resilient Security Ratings Platform with ScyllaDBScalable and Resilient Security Ratings Platform with ScyllaDB
Scalable and Resilient Security Ratings Platform with ScyllaDB
ScyllaDB
 
Building functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortalBuilding functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortal
Dmitriy Gumeniuk
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
Josh Patterson
 
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction ProfilerSplunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
Lars Albertsson
 
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
Sergii Khomenko
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
Miguel González-Fierro
 
Maximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and PracticesMaximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and Practices
Jeff Bertman
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipeline
CloverDX
 

Similar to Data Quality With or Without Apache Spark and Its Ecosystem (20)

AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019AIDC NY: BODO AI Presentation - 09.19.2019
AIDC NY: BODO AI Presentation - 09.19.2019
 
Testing Big Data solutions fast and furiously
Testing Big Data solutions fast and furiouslyTesting Big Data solutions fast and furiously
Testing Big Data solutions fast and furiously
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiouslyQA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
QA Fest 2019. Дмитрий Собко. Testing Big Data solutions fast and furiously
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
 
Scalable and Resilient Security Ratings Platform with ScyllaDB
Scalable and Resilient Security Ratings Platform with ScyllaDBScalable and Resilient Security Ratings Platform with ScyllaDB
Scalable and Resilient Security Ratings Platform with ScyllaDB
 
Building functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortalBuilding functional Quality Gates with ReportPortal
Building functional Quality Gates with ReportPortal
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction ProfilerSplunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
Splunk conf2014 - Dashboard Fun - Creating an Interactive Transaction Profiler
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
Building Ranking Infrastructure: Data-Driven, Lean, Flexible - Sergii Khomenk...
 
Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...Running Intelligent Applications inside a Database: Deep Learning with Python...
Running Intelligent Applications inside a Database: Deep Learning with Python...
 
Maximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and PracticesMaximize Big Data ROI via Best of Breed Patterns and Practices
Maximize Big Data ROI via Best of Breed Patterns and Practices
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
How to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipelineHow to build an automated customer data onboarding pipeline
How to build an automated customer data onboarding pipeline
 

More from Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 

More from Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Recently uploaded

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 

Recently uploaded (20)

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 

Data Quality With or Without Apache Spark and Its Ecosystem

  • 1. Data Quality with or without Apache Spark and its ecosystem Serge Smertin Sr. Resident Solutions Architect at Databricks
  • 2. ▪ Intro ▪ Dimensions ▪ Frameworks ▪ TLDR ▪ Outro
  • 3. About me ▪ Worked in all stages of data lifecycle for the past 14 years ▪ Built data science platforms from scratch ▪ Tracked cyber criminals through massively scaled data forensics ▪ Built anti-PII analysis measures for payments industry ▪ Bringing Databricks strategic customers to next level as full-time job now
  • 4. Colleen Graham “Performance Management Driving BI Spending”, InformationWeek, February 14, 2006 https://www.informationweek.com/performance-management-driving-bi-spending/d/d-id/10405 52 Data quality requires certain level of sophistication within a company to even understand that it’s a problem.
  • 6. Data Catalogs Data Profiling ETL Metrics repository Alerting Noise filtering Dashboards Oncall Completeness Consistency Uniqueness Timeliness Relevance Accuracy Validity Quality Checks
  • 7. Record level Database level - Stream-friendly - Quarantine invalid data - Debug and re-process - Make sure to (re-)watch “Make reliable ETL easy on Delta Lake” talk - Batch-friendly - See health of the entire pipeline - Detect processing anomalies - Reconciliation testing - Mutual information analysis - This talk
  • 8. Data owners and Subject Matter Experts define ideal shape of the data May not fully cover all aspects, when number of datasets is bigger that SME team Often is the only way for larger orgs, where expertise still has to be developed internally May lead to incomplete data coverage and missed signals about problems in data pipelines Exploration Expertise Semi-supervised code generation based on data profiling results May overfit alerting with rules that are too strict by default, resulting in more noise than signal Automation
  • 9. Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen.
  • 10. “1” if check(s) succeeded for a given row. Result is averaged. Streaming friendly. Success Keys Check compares incoming batch with existing dataset - e.g. unique keys Domain Keys Materialised synthetic aggregations - e.g. is this batch |2σ| records different than previous? Dataset Metrics Repeat computation in a separate, simplified pipeline and validate results - e.g. double-entry bookkeeping Reconciliation Tests
  • 11. If you “build your own everything” - consider embedding Deequ.It has has constraint suggestion among advanced enterprise features like data profiling and anomaly detection out of the box, though documentation is not that extensive. And you may want to fork it internally.
  • 12. Deequ code generation from pydeequ.suggestions import * suggestionResult = ( ConstraintSuggestionRunner(spark) .onData(spark.table('demo')) .addConstraintRule(DEFAULT()) .run()) print('from pydeequ.checks import *') print('check = (Check(spark, CheckLevel.Warning, "Generated check")') for suggestion in suggestionResult['constraint_suggestions']: if 'Fractional' in suggestion['suggesting_rule']: continue print(f' {suggestion["code_for_constraint"]}') print(')') from pydeequ.checks import * check = (Check(spark, CheckLevel.Warning, "Generated check") .isComplete("b") .isNonNegative("b") .isComplete("a") .isNonNegative("a") .isUnique("a") .hasCompleteness("c", lambda x: x >= 0.32, "It should be above 0.32!"))
  • 13. Great Expectations is less enterprise'y data validation platform written in Python, that focuses on supporting Apache Spark among other data sources, like Postgres, Pandas, BigQuery, and so on.
  • 14. Pandas Profiling ▪ Exploratory Data Analysis simplified by generating HTML report ▪ Native bi-directional integration with Great Expectations ▪ great_expectations profile DATASOURCE ▪ (pandas_profiling .ProfileReport(pandas_df) .to_expectation_suite()) https://pandas-profiling.github.io/pandas-profiling/
  • 15. Apache Griffin may be the most enterprise-oriented solution with user interface available, given the fact it being Apache top-level project and backed up by eBay since 2016, but it is not as easily embeddable into existing applications, because it requires standalone deployment along with JSON DSL definitions for rules.
  • 16.
  • 17. Completeness SELECT AVG(IF(c IS NOT NULL, 1, 0)) AS isComplete FROM demo Deequ PySpark Great Expectations SQL
  • 18. Uniqueness SELECT (COUNT(DISTINCT c) / COUNT(1)) AS isUnique FROM demo Deequ Great Expectations PySpark SQL
  • 19. Validity SELECT AVG(IF(a < b, 1, 0)) AS isValid FROM demo Deequ Great Expectations PySpark SQL
  • 20. Timeliness SELECT NOW() - MAX(rawEventTime) AS delay FROM processed_events raw events processed events
  • 21. Honorable Mentions • https://github.com/FRosner/drunken-data-quality • https://github.com/databrickslabs/dataframe-rules-engine Make sure to (re-)watch “Make reliable ETL easy on Delta Lake” talk
  • 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.