Data testing
Gleb Mezhanskiy
CEO & Co-founder @ Datafold
Ex-Lyft Data PM
Agenda
1. Principles of data testing
2. Embedding testing in production & development workflows
3. Types of tests, pros/cons & tools
principles of effective
data testing3
Embed testing
in existing
workflows
Automate
everything
Cut
the noise
1 2 3
Data testing in production
Goal: catch issues as early and upstream as possible
Run ETL batch
Run tests
Tests
pass?
YES
NO
Notify
owners
Investigate
/ Fix
Publish
new data
What to test in production
Assertions
Metric
monitoring
checking
hard rules
detecting
anomalies
in metrics
When
to use
assertions
Value-level checks
> assert `email` is of xxx@yyy.com format
> assert `user_id` is unique and not-null
> assert SUM(source.revenue) = SUM(target.revenue)
Integrity checks
Balance checks
Tools for running assertions
Embedded in ETL tools
> dbt for SQL
> Dagster for general ETL
Standalone
> great_expectations for SQL
> deequ for Spark
Hard rules don’t work for metrics because
of natural variance, trend and seasonality!
When to use metric monitoring
Hard rules don’t work for metrics because
of natural variance, trend and seasonality! Answer:
apply a bit of ML!
When to use metric monitoring
Tools for metric monitoring
Prophet
by Facebook
Datafold
Alerting
Data testing in development
Goal: do no harm – prevent breaking things that work
Build & Backfill
Run tests
Tests
pass?
YES
NO
Code
review
Deploy
Investigate
/ Fix
How to test in development
Assertions Data diff
checking
hard rules
– just like
in production!
visualizing
changes
in data
Data diff = git diff for data
Compares values
Production
Development
…and distributions
Remember –
automate!
Diff tools
> dbt-audit-helper
> BigDiffy by Spotify
> Datafold Diff
Bottom line
Development Production
What changes in between tests
Goal
Frequency
Trigger
Methods
Assertions
Data diff
Assertions
Metric monitoring
On every new data batchOn every commit / PR
Github/lab + CI ETL orchestrators
Prevent regressions
Source code Data
Learn about issues ASAP

Data Testing

  • 1.
    Data testing Gleb Mezhanskiy CEO& Co-founder @ Datafold Ex-Lyft Data PM
  • 2.
    Agenda 1. Principles ofdata testing 2. Embedding testing in production & development workflows 3. Types of tests, pros/cons & tools
  • 3.
    principles of effective datatesting3 Embed testing in existing workflows Automate everything Cut the noise 1 2 3
  • 4.
    Data testing inproduction Goal: catch issues as early and upstream as possible Run ETL batch Run tests Tests pass? YES NO Notify owners Investigate / Fix Publish new data
  • 5.
    What to testin production Assertions Metric monitoring checking hard rules detecting anomalies in metrics
  • 6.
    When to use assertions Value-level checks >assert `email` is of xxx@yyy.com format > assert `user_id` is unique and not-null > assert SUM(source.revenue) = SUM(target.revenue) Integrity checks Balance checks
  • 7.
    Tools for runningassertions Embedded in ETL tools > dbt for SQL > Dagster for general ETL Standalone > great_expectations for SQL > deequ for Spark
  • 8.
    Hard rules don’twork for metrics because of natural variance, trend and seasonality! When to use metric monitoring
  • 9.
    Hard rules don’twork for metrics because of natural variance, trend and seasonality! Answer: apply a bit of ML! When to use metric monitoring
  • 10.
    Tools for metricmonitoring Prophet by Facebook Datafold Alerting
  • 11.
    Data testing indevelopment Goal: do no harm – prevent breaking things that work Build & Backfill Run tests Tests pass? YES NO Code review Deploy Investigate / Fix
  • 12.
    How to testin development Assertions Data diff checking hard rules – just like in production! visualizing changes in data
  • 14.
    Data diff =git diff for data Compares values Production Development
  • 15.
  • 16.
  • 17.
    Diff tools > dbt-audit-helper >BigDiffy by Spotify > Datafold Diff
  • 18.
    Bottom line Development Production Whatchanges in between tests Goal Frequency Trigger Methods Assertions Data diff Assertions Metric monitoring On every new data batchOn every commit / PR Github/lab + CI ETL orchestrators Prevent regressions Source code Data Learn about issues ASAP