Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataEngConf SF16 - Data Asserts: Defensive Data Science


Published on

Talk by Tommy Guy, Microsoft. To hear about future conferences go to

Published in: Technology
  • Be the first to comment

DataEngConf SF16 - Data Asserts: Defensive Data Science

  1. 1. Data Asserts Defensive Data Science Tommy Guy Microsoft
  2. 2. Observation: Complexity In Pipeline
  3. 3. Our pipeline: DATA!!! Insight! Direction! Strategy!
  4. 4. Our pipeline in reality: bugs tend to compound DATA!!!
  5. 5. How do Engineers Manage Complexity? Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.
  6. 6. Data introduces a few complications Pipelines take many upstream dependencies Researcher use cases are frequently unknown and unanticipated by data providers. Pushing requirements upstream to all producers is Sisyphean.
  7. 7. We are not talking about data pipeline tests The data pipeline teams: Are all rows that are produced stored • Counter fields to ensure no dropped rows • Sentinel events to measure join fidelity Are availability SLAs being met? • Progressive server-client merging
  8. 8. Data Scientists Require Semantic Correctness Does this field mean what I think it does?
  9. 9. How do Data Scientists identify potential errors?
  10. 10. How do Data Scientists identify potential errors? Some follow-on fact is absurd… … which leads to investigation … … which finds a broader problem If [potential conclusion], then we must have 3 billion OneDrive users… … because my user table doesn’t have a primary key … … so I should aggregate by user.
  11. 11. What are your Assumptions? If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions: Field Assumptions User Id • Logged and PII-encrypted similarly in Outlook and OneDrive • Correctly logging timestamp for Office purchase • User Id isn’t empty or missing OneDrive activity • Wasn’t automated traffic [identified by a certain flag]. Email Activity • Mobile client identifiers are correct. All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.
  12. 12. What are your Sanity Checks? • If a column “OfficeId” is really a user id, it has certain known properties: • Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often. Assumption Why does it matter? Never null/empty Causes job-breaking data skew issues Users are 1:* with Tenants Logical constraint: sign you are missing something. Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id. All rows in event data join to it Otherwise, your data is incomplete. Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.
  13. 13. Data Asserts: Defensive Data Science
  14. 14. Data Asserts: Maintain Quality
  15. 15. Data Asserts: Clear Trust Boundaries
  16. 16. These should match! Data Asserts: Defensive Data Science
  17. 17. Data Asserts in Production: A few Observations • Most of the analysis-impacting assertion failures we’ve seen were actually errors in our assumptions not errors in the pipeline. • Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines. • Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.