DataEngConf SF16 - Data Asserts: Defensive Data Science

•

3 likes•1,085 views

Hakka Labs

Talk by Tommy Guy, Microsoft. To hear about future conferences go to http://dataengconf.com

Technology

Data Asserts
Defensive Data Science
Tommy Guy
Microsoft

Our pipeline:
DATA!!!
Insight! Direction! Strategy!

Our pipeline in reality: bugs tend to compound
DATA!!!

How do Engineers Manage Complexity?
Encapsulate: create functions/classes/subsystems
with clear APIs. This helps isolate complexity
Integration Tests: ensure that the components interact
correctly. This helps identify breaking changes.

Data introduces a few complications
Pipelines take many upstream dependencies
Researcher use cases are frequently unknown and
unanticipated by data providers.
Pushing requirements upstream to all producers is
Sisyphean.

We are not talking about data pipeline tests
The data pipeline teams:
Are all rows that are produced stored
• Counter fields to ensure no dropped rows
• Sentinel events to measure join fidelity
Are availability SLAs being met?
• Progressive server-client merging

Data Scientists Require Semantic Correctness
Does this field mean what I think it does?

How do Data Scientists identify potential
errors?

How do Data Scientists identify potential
errors?
Some follow-on fact is absurd…
… which leads to investigation …
… which finds a broader problem
If [potential conclusion], then we must have 3 billion
OneDrive users…
… because my user table doesn’t have a primary key …
… so I should aggregate by user.

What are your Assumptions?
If I conclude “Users who upload files to OneDrive are XXX% more likely
to buy Office if they also sent mail through Mobile Outlook”, I’m
making many silent assumptions:
Field Assumptions
User Id • Logged and PII-encrypted similarly in Outlook and OneDrive
• Correctly logging timestamp for Office purchase
• User Id isn’t empty or missing
OneDrive activity • Wasn’t automated traffic [identified by a certain flag].
Email Activity • Mobile client identifiers are correct.
All • Any upstream changes to OneDrive, Office, or Exchange
data have been communicated to pipeline owners.

What are your Sanity Checks?
• If a column “OfficeId” is really a user id, it has certain known properties:
• Observation: these sorts of checks take place when the pipeline is set
up, but they may not be re-checked very often.
Assumption Why does it matter?
Never null/empty Causes job-breaking data skew issues
Users are 1:* with Tenants Logical constraint: sign you are missing something.
Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.
All rows in event data join to it Otherwise, your data is incomplete.
Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a
user-id.

These should
match!
Data Asserts: Defensive Data Science

Data Asserts in Production: A few
Observations
• Most of the analysis-impacting assertion failures we’ve seen were
actually errors in our assumptions not errors in the pipeline.
• Good tests beget good code: we’ve had to modularize our code in
order to produce testable chunks that get re-used in pipelines.
• Data Asserts is the backbone to data provenance. A data conclusion
can directly link all of the assumptions about the input that we made.

What's hot

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and ScalabiltyAndreas Grabner

Metrics & more Stefan Thies

Dr. Datascience or: How I Learned to Stop Munging and Love TestsWork-Bench

(BDT207) Real-Time Analytics In Service Of Self-Healing EcosystemsAmazon Web Services

How to keep you out of the News: Web and End-to-End Performance TipsAndreas Grabner

Database DevOps Anti-patternsAlex Yates

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and HowAndreas Grabner

Nordstrom Customer PresentationSplunk

DMCA#21: reactive-programmingOlivier Destrebecq

Web and App Performance: Top Problems to avoid to keep you out of the NewsAndreas Grabner

BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!Andreas Grabner

Getting CI right for SQL ServerAlex Yates

Code Once Use Often with Declarative Data PipelinesDatabricks

DevOps 101 for data professionalsAlex Yates

HSPS 2015 - SharePoint Performance Santiy ChecksAndreas Grabner

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

Machine learning model to productionGeorg Heiler

node-crate: node.js and big dataStefan Thies

Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine

Building Scalable Prediction Services in RWork-Bench

What's hot (20)

Docker/DevOps Meetup: Metrics-Driven Continuous Performance and Scalabilty

Metrics & more

Dr. Datascience or: How I Learned to Stop Munging and Love Tests

(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems

How to keep you out of the News: Web and End-to-End Performance Tips

Database DevOps Anti-patterns

Boston DevOps Days 2016: Implementing Metrics Driven DevOps - Why and How

Nordstrom Customer Presentation

DMCA#21: reactive-programming

Web and App Performance: Top Problems to avoid to keep you out of the News

BTD2015 - Your Place In DevTOps is Finding Solutions - Not Just Bugs!

Getting CI right for SQL Server

Code Once Use Often with Declarative Data Pipelines

DevOps 101 for data professionals

HSPS 2015 - SharePoint Performance Santiy Checks

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Machine learning model to production

node-crate: node.js and big data

Netflix Data Engineering @ Uber Engineering Meetup

Building Scalable Prediction Services in R

Viewers also liked

DataEngConf SF16 - Beginning with OurselvesHakka Labs

DataEngConf SF16 - Routing Billions of Analytics Events with High DeliverabilityHakka Labs

DataEngConf SF16 - High cardinality time series searchHakka Labs

DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs

DataEngConf SF16 - Running simulations at scaleHakka Labs

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...Hakka Labs

DataEngConf SF16 - Recommendations at InstacartHakka Labs

DataEngConf SF16 - Entity Resolution in Data Pipelines Using SparkHakka Labs

Always Valid Inference (Ramesh Johari, Stanford)Hakka Labs

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQHakka Labs

DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs

DataEngConf SF16 - Deriving Meaning from Wearable Sensor DataHakka Labs

DataEngConf SF16 - Bridging the gap between data science and data engineeringHakka Labs

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...Hakka Labs

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs

DataEngConf SF16 - Multi-temporal Data StructuresHakka Labs

Viewers also liked (16)

DataEngConf SF16 - Beginning with Ourselves

DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability

DataEngConf SF16 - High cardinality time series search

DataEngConf SF16 - Collecting and Moving Data at Scale

DataEngConf SF16 - Running simulations at scale

DataEngConf SF16 - Tales from the other side - What a hiring manager wish you...

DataEngConf SF16 - Recommendations at Instacart

DataEngConf SF16 - Entity Resolution in Data Pipelines Using Spark

Always Valid Inference (Ramesh Johari, Stanford)

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest

DataEngConf SF16 - Deriving Meaning from Wearable Sensor Data

DataEngConf SF16 - Bridging the gap between data science and data engineering

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambd...

DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data

DataEngConf SF16 - Multi-temporal Data Structures

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science

Measuring Data Quality with DataOpsSteven Ensslen

IT Operation Analytic for security- MiSSconf(sp1)stelligence

Data QualityVijaya K

Building the enterprise data architectureCosta Pissaris

Data quality and data profilingShailja Khurana

End User InformaticsAmbareesh Kulkarni

BI on Big Data PresentationArcadia Data

Creating a Data validation and Testing StrategyRTTS

MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus

Qiagramjwppz

How to improve your system monitoringAndrew White

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA

Automate data warehouse etl testing and migration testing the agile wayTorana, Inc.

Pragmatics Driven Issues in Data and Process Integrity in EnterprisesAmit Sheth

DGIQ 2015 The Fundamentals of Data QualityCaserta

Starting Your DevOps Journey – Practical Tips for OpsDynatrace

Use of Formal Methods at Amazon Web ServicesSulman Ahmed

Data Quality: principles, approaches, and best practicesCarl Anderson

Predictive Analytics - Big Data Warehousing MeetupCaserta

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf4dalert

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

Measuring Data Quality with DataOps

IT Operation Analytic for security- MiSSconf(sp1)

Data Quality

Building the enterprise data architecture

Data quality and data profiling

End User Informatics

BI on Big Data Presentation

Creating a Data validation and Testing Strategy

MLOps and Data Quality: Deploying Reliable ML Models in Production

Qiagram

How to improve your system monitoring

Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...

Automate data warehouse etl testing and migration testing the agile way

Pragmatics Driven Issues in Data and Process Integrity in Enterprises

DGIQ 2015 The Fundamentals of Data Quality

Starting Your DevOps Journey – Practical Tips for Ops

Use of Formal Methods at Amazon Web Services

Data Quality: principles, approaches, and best practices

Predictive Analytics - Big Data Warehousing Meetup

Automatic Data Reconciliation, Data Quality, and Data Observability.pdf

Recently uploaded

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Understanding the Laravel MVC ArchitecturePixlogix Infotech

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Slack Application Development 101 Slidespraypatel2

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

How to convert PDF to text with Nanonetsnaman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

A Domino Admins Adventures (Engage 2024)Gabriella Davis

A Call to Action for Generative AI in 2024Results

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity

Maximizing Board Effectiveness 2024 Webinar.pptx

Understanding the Laravel MVC Architecture

SQL Database Design For Developers at php[tek] 2024

Slack Application Development 101 Slides

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

How to convert PDF to text with Nanonets

The 7 Things I Know About Cyber Security After 25 Years | April 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

GenCyber Cyber Security Day Presentation

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Unblocking The Main Thread Solving ANRs and Frozen Frames

How to Troubleshoot Apps for the Modern Connected Worker

08448380779 Call Girls In Friends Colony Women Seeking Men

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Google AI Hackathon: LLM based Evaluator for RAG

A Domino Admins Adventures (Engage 2024)

A Call to Action for Generative AI in 2024

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

DataEngConf SF16 - Data Asserts: Defensive Data Science

1. Data Asserts Defensive Data Science Tommy Guy Microsoft

2. Observation: Complexity In Pipeline

3. Our pipeline: DATA!!! Insight! Direction! Strategy!

4. Our pipeline in reality: bugs tend to compound DATA!!!

5. How do Engineers Manage Complexity? Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.

6. Data introduces a few complications Pipelines take many upstream dependencies Researcher use cases are frequently unknown and unanticipated by data providers. Pushing requirements upstream to all producers is Sisyphean.

7. We are not talking about data pipeline tests The data pipeline teams: Are all rows that are produced stored • Counter fields to ensure no dropped rows • Sentinel events to measure join fidelity Are availability SLAs being met? • Progressive server-client merging

8. Data Scientists Require Semantic Correctness Does this field mean what I think it does?

9. How do Data Scientists identify potential errors?

10. How do Data Scientists identify potential errors? Some follow-on fact is absurd… … which leads to investigation … … which finds a broader problem If [potential conclusion], then we must have 3 billion OneDrive users… … because my user table doesn’t have a primary key … … so I should aggregate by user.

11. What are your Assumptions? If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions: Field Assumptions User Id • Logged and PII-encrypted similarly in Outlook and OneDrive • Correctly logging timestamp for Office purchase • User Id isn’t empty or missing OneDrive activity • Wasn’t automated traffic [identified by a certain flag]. Email Activity • Mobile client identifiers are correct. All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.

12. What are your Sanity Checks? • If a column “OfficeId” is really a user id, it has certain known properties: • Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often. Assumption Why does it matter? Never null/empty Causes job-breaking data skew issues Users are 1:* with Tenants Logical constraint: sign you are missing something. Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id. All rows in event data join to it Otherwise, your data is incomplete. Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.

13. Data Asserts: Defensive Data Science

14. Data Asserts: Maintain Quality

15. Data Asserts: Clear Trust Boundaries

16. These should match! Data Asserts: Defensive Data Science

17. Data Asserts in Production: A few Observations • Most of the analysis-impacting assertion failures we’ve seen were actually errors in our assumptions not errors in the pipeline. • Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines. • Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.

DataEngConf SF16 - Data Asserts: Defensive Data Science

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science

Similar to DataEngConf SF16 - Data Asserts: Defensive Data Science (20)

More from Hakka Labs

More from Hakka Labs (12)

Recently uploaded

Recently uploaded (20)

DataEngConf SF16 - Data Asserts: Defensive Data Science