Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
#datapopupseattle
Correctness in Data Science
Benjamin Skrainka
Principal Data Scientist + Lead Instructor, Galvanize
bskr...
#datapopupseattle
UNSTRUCTURED
Data Science POP-UP in Seattle
www.dominodatalab.com
D
Produced by Domino Data Lab
Domino’s...
Correctness in Data Science
Benjamin S. Skrainka
October 7, 2015
Benjamin S. Skrainka Correctness in Data Science October ...
The correctness problem
A lot of (data) science is unscientific:
“My code runs, so the answer must be correct”
“It passed E...
Correctness matters
Bad (data) science:
Costs real money and can kill people
Will eventually damage your reputation and ca...
Objectives
Today’s goals:
Introduce VV&UQ framework to evaluate correctness of scientific
models
Survey good habits to impr...
Verification, Validation, & Uncertainty Quantification
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 5 / ...
Introduction to VV&UQ
Verification, Validation, & Uncertainty Quantification provides
epistemological framework to evaluate ...
Definitions of VV&UQ
Definitions of terms (Oberkampf & Roy):
Verification:
I “solving equations right”
I I.e., code implement...
Definition of UQ
Definition of Uncertainty Quantification (Oberkampf & Roy):
Process of identifying, characterizing, and quan...
Verification of code
Does your code implement the model correctly?
Unit test everything you can:
I Scientific code can be un...
Verification of SQL
Passing Explain Plan doesn’t mean your SQL is correct:
Garbage in, garbage out
Check a simple case you ...
Unit test
import unittest2 as unittest
import assignment as problems
class TestAssignment(unittest.TestCase):
def test_zer...
Unit test
Figure 1:Benjamin S. Skrainka Correctness in Data Science October 7, 2015 12 / 24
Validation of model
Check your model is a good (enough) representation of reality:
“All models are wrong but some are usef...
Approaches to experimentation
Many ways to test:
A/B test
Multi-armed bandit
Bayesian A/B test
Wald sequential analysis
Be...
Uncertainty quantification
There are many types of uncertainty which a ect the robustness of your
model:
Parameter uncertai...
Good habits
Benjamin S. Skrainka Correctness in Data Science October 7, 2015 16 / 24
Act like a software engineer
Use best practices from software engineering:
Good design of code
Follow a sensible coding co...
Reproducible research
‘Document what you do and do what you document’:
Keep a journal!
Data provenance
How data was cleane...
Follow a workflow
Use a workflow like CRISP-DM:
1 Define business question and metric
2 Understand data
3 Prepare data
4 Buil...
Automate your data pipeline
One-touch build of your application or paper:
Automate entire workflow from raw data to final re...
Write flexible code to handle data
Use constants/macros to access data fields:
Code will clearly show what data matters
Easi...
Python example
# Setup indicators
ix_gdp = 7
...
# Load & clean data
m_raw = np.recfromcsv( bea_gdp.csv )
gdp = m_raw[:, i...
Politics. . .
Often, there is political pressure to violate best practice:
Examples:
I 80% confidence intervals
I Absurd at...
Conclusion
Need to raise the quality of data science:
VV & UQ provides rigorous framework:
I Verification: solve the equati...
Upcoming SlideShare
Loading in …5
×

Correctness in Data Science - Data Science Pop-up Seattle

6,777 views

Published on

Presented by: Benjamin S. Skrainka is a Principal Data Scientist and Lead Instructor at Galvanize, Inc. For several decades, he has built practical solutions to relevant problems using the best statistical and engineering tools. His expertise spans several problem domains, including sequencing DNA, estimating demand for differentiated products, measuring advertising efficacy, and forecasting for capacity planning. Ben earned an AB in Physics from Princeton University and a PhD in Economics from University College London.

Published in: Data & Analytics
  • Download The Complete Lean Belly Breakthrough Program with Special Discount.  https://tinyurl.com/bkfitness4u
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Do This Simple 2-Minute Ritual To Loss 1 Pound Of Belly Fat Every 72 Hours ◆◆◆ https://tinyurl.com/bkfitness4u
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • "Do you keep a journal? You should. Fermi taught me that." -John A. Wheeler
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Correctness in Data Science - Data Science Pop-up Seattle

  1. 1. #datapopupseattle Correctness in Data Science Benjamin Skrainka Principal Data Scientist + Lead Instructor, Galvanize bskrainka galvanize
  2. 2. #datapopupseattle UNSTRUCTURED Data Science POP-UP in Seattle www.dominodatalab.com D Produced by Domino Data Lab Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish models into production faster.
  3. 3. Correctness in Data Science Benjamin S. Skrainka October 7, 2015 Benjamin S. Skrainka Correctness in Data Science October 7, 2015 1 / 24
  4. 4. The correctness problem A lot of (data) science is unscientific: “My code runs, so the answer must be correct” “It passed Explain Plan, so the answer is correct” “This model is too complex to have a design document” “It is impossible to unit test scientific code” “The lift from the direct mail campaign is 10%”" Benjamin S. Skrainka Correctness in Data Science October 7, 2015 2 / 24
  5. 5. Correctness matters Bad (data) science: Costs real money and can kill people Will eventually damage your reputation and career Could expose you to litigation An issue of basic integrity and sleeping at night Benjamin S. Skrainka Correctness in Data Science October 7, 2015 3 / 24
  6. 6. Objectives Today’s goals: Introduce VV&UQ framework to evaluate correctness of scientific models Survey good habits to improve quality of your work Benjamin S. Skrainka Correctness in Data Science October 7, 2015 4 / 24
  7. 7. Verification, Validation, & Uncertainty Quantification Benjamin S. Skrainka Correctness in Data Science October 7, 2015 5 / 24
  8. 8. Introduction to VV&UQ Verification, Validation, & Uncertainty Quantification provides epistemological framework to evaluate correctness of scientific models: Evidence of correctness should accompany any prediction In absence of evidence, assume predictions are wrong Popper: can only disprove or fail to disprove a model VV&UQ is inductive whereas science is deductive Reference: Verification and Validation in Scientific Computing by Oberkampf & Roy Benjamin S. Skrainka Correctness in Data Science October 7, 2015 6 / 24
  9. 9. Definitions of VV&UQ Definitions of terms (Oberkampf & Roy): Verification: I “solving equations right” I I.e., code implements the model correctly Validation: I “solving right equations” I I.e., model has high fidelity to reality Definitions of VV&UQ will vary depending on source . . . æ Most organizations do not even practice verification. . . Benjamin S. Skrainka Correctness in Data Science October 7, 2015 7 / 24
  10. 10. Definition of UQ Definition of Uncertainty Quantification (Oberkampf & Roy): Process of identifying, characterizing, and quantifying those factors in an analysis which could a ect accuracy of computational results Do your assumptions hold? When do they fail? Does your model apply to the data/situation? Where does your model break down? What are its limits? Benjamin S. Skrainka Correctness in Data Science October 7, 2015 8 / 24
  11. 11. Verification of code Does your code implement the model correctly? Unit test everything you can: I Scientific code can be unit tested I Test special cases I Test on cases with analytic solutions I Test on synthetic data Unit test framework will setup and tear-down fixtures Should be able to recover parameters from Monte Carlo data Benjamin S. Skrainka Correctness in Data Science October 7, 2015 9 / 24
  12. 12. Verification of SQL Passing Explain Plan doesn’t mean your SQL is correct: Garbage in, garbage out Check a simple case you can compute by hand Check join plan is correct Check aggregate statistics Check answer is compatible with reality Benjamin S. Skrainka Correctness in Data Science October 7, 2015 10 / 24
  13. 13. Unit test import unittest2 as unittest import assignment as problems class TestAssignment(unittest.TestCase): def test_zero(self): result = problems.question_zero() self.assertEqual(result, 9198) ... if __name__ == __main__ : unittest.main() Benjamin S. Skrainka Correctness in Data Science October 7, 2015 11 / 24
  14. 14. Unit test Figure 1:Benjamin S. Skrainka Correctness in Data Science October 7, 2015 12 / 24
  15. 15. Validation of model Check your model is a good (enough) representation of reality: “All models are wrong but some are useful” – George Box Run an experiment Perform specification testing Test assumptions hold Beware of endogenous features Benjamin S. Skrainka Correctness in Data Science October 7, 2015 13 / 24
  16. 16. Approaches to experimentation Many ways to test: A/B test Multi-armed bandit Bayesian A/B test Wald sequential analysis Benjamin S. Skrainka Correctness in Data Science October 7, 2015 14 / 24
  17. 17. Uncertainty quantification There are many types of uncertainty which a ect the robustness of your model: Parameter uncertainty Structural uncertainty Algorithmic uncertainty Experimental uncertainty Interpolation uncertainty Classified as aleatoric (statistical) and epistemic (systematic) Benjamin S. Skrainka Correctness in Data Science October 7, 2015 15 / 24
  18. 18. Good habits Benjamin S. Skrainka Correctness in Data Science October 7, 2015 16 / 24
  19. 19. Act like a software engineer Use best practices from software engineering: Good design of code Follow a sensible coding convention Version control Use same file structure for every project Unit test Use PEP8 or equivalent Perform code reviews Benjamin S. Skrainka Correctness in Data Science October 7, 2015 17 / 24
  20. 20. Reproducible research ‘Document what you do and do what you document’: Keep a journal! Data provenance How data was cleaned Design document Specification & requirements Do you keep a journal? You should. Fermi taught me that. – John A. Wheeler Benjamin S. Skrainka Correctness in Data Science October 7, 2015 18 / 24
  21. 21. Follow a workflow Use a workflow like CRISP-DM: 1 Define business question and metric 2 Understand data 3 Prepare data 4 Build model 5 Evaluate 6 Deploy Ensures you don’t forget any key steps Benjamin S. Skrainka Correctness in Data Science October 7, 2015 19 / 24
  22. 22. Automate your data pipeline One-touch build of your application or paper: Automate entire workflow from raw data to final result Ensures you perform all steps Ensures all steps are known – no one o manual adjustments Avoids stupid human errors Auto generate all tables and figures Save time when handling new data . . . which always has subtle changes in formatting Benjamin S. Skrainka Correctness in Data Science October 7, 2015 20 / 24
  23. 23. Write flexible code to handle data Use constants/macros to access data fields: Code will clearly show what data matters Easier to understand code and data pipeline Easier to debug data problems Easier to handles changes in data formatting Benjamin S. Skrainka Correctness in Data Science October 7, 2015 21 / 24
  24. 24. Python example # Setup indicators ix_gdp = 7 ... # Load & clean data m_raw = np.recfromcsv( bea_gdp.csv ) gdp = m_raw[:, ix_gdp] ... Benjamin S. Skrainka Correctness in Data Science October 7, 2015 22 / 24
  25. 25. Politics. . . Often, there is political pressure to violate best practice: Examples: I 80% confidence intervals I Absurd attribution window I Two year forecast horizon but only three months of data Hard to do right thing vs. senior management Recruit a high-level scientist to advocate Particularly common with forecasting: I Often requested by management for CYA I Insist on a ‘panel of experts’ for impossible decisions Benjamin S. Skrainka Correctness in Data Science October 7, 2015 23 / 24
  26. 26. Conclusion Need to raise the quality of data science: VV & UQ provides rigorous framework: I Verification: solve the equations right I Validation: solve the right equations I Uncertainty quantification: how robust is model to unknowns? Adopting good habits provides huge gains for minimal e ort Benjamin S. Skrainka Correctness in Data Science October 7, 2015 24 / 24

×