SlideShare a Scribd company logo
Tune up your data science process
Benjamin S. Skrainka
February 10, 2016
Benjamin S. Skrainka Tune up your data science process February 10, 2016 1 / 24
The correctness problem
A lot of (data) science is unscientific:
“My code runs, so the answer must be correct”
“It passed Explain Plan, so the answer is correct”
“This model is too complex to have a design document”
“It is impossible to unit test scientific code”
“The lift from the direct mail campaign is 10%”
Benjamin S. Skrainka Tune up your data science process February 10, 2016 2 / 24
Correctness matters
Bad (data) science:
Costs real money and can kill people
Will eventually damage your reputation and career
Could expose you to litigation
An issue of basic integrity and sleeping at night
Benjamin S. Skrainka Tune up your data science process February 10, 2016 3 / 24
Objectives
Today’s goals:
Introduce VV&UQ framework to evaluate correctness of scientific
models
Survey good habits to improve quality of your work
Benjamin S. Skrainka Tune up your data science process February 10, 2016 4 / 24
Verification, Validation, & Uncertainty Quantification
Benjamin S. Skrainka Tune up your data science process February 10, 2016 5 / 24
Introduction to VV&UQ
Verification, Validation, & Uncertainty Quantification provides
epistemological framework to evaluate correctness of scientific models:
Evidence of correctness should accompany any prediction
In absence of evidence, assume predictions are wrong
Popper: can only disprove or fail to disprove a model
VV&UQ is inductive whereas science is deductive
Reference: Verification and Validation in Scientific Computing by
Oberkampf & Roy
Benjamin S. Skrainka Tune up your data science process February 10, 2016 6 / 24
Definitions of VV&UQ
Definitions of terms (Oberkampf & Roy):
Verification:
“solving equations right”
I.e., code implements the model correctly
Validation:
“solving right equations”
I.e., model has high fidelity to reality
Definitions of VV&UQ will vary depending on source . . .
→ Most organizations do not even practice verification. . .
Benjamin S. Skrainka Tune up your data science process February 10, 2016 7 / 24
Definition of UQ
Definition of Uncertainty Quantification (Oberkampf & Roy):
Process of identifying, characterizing, and quantifying those
factors in an analysis which could affect accuracy of
computational results
Do your assumptions hold? When do they fail?
Does your model apply to the data/situation?
Where does your model break down? What are its limits?
Benjamin S. Skrainka Tune up your data science process February 10, 2016 8 / 24
Verification of code
Does your code implement the model correctly?
Unit test everything you can:
Scientific code can be unit tested
Test special cases
Test on cases with analytic solutions
Test on synthetic data
Unit test framework will setup and tear-down fixtures
Should be able to recover parameters from Monte Carlo data
Benjamin S. Skrainka Tune up your data science process February 10, 2016 9 / 24
Verification of SQL
Passing Explain Plan doesn’t mean your SQL is correct:
Garbage in, garbage out
Check a simple case you can compute by hand
Check join plan is correct
Check aggregate statistics
Check answer is compatible with reality
Benjamin S. Skrainka Tune up your data science process February 10, 2016 10 / 24
Unit test
import unittest2 as unittest
import assignment as problems
class TestAssignment(unittest.TestCase):
def test_zero(self):
result = problems.question_zero()
self.assertEqual(result, 9198)
...
if __name__ == '__main__':
unittest.main()
Benjamin S. Skrainka Tune up your data science process February 10, 2016 11 / 24
Unit test
Figure 1:Benjamin S. Skrainka Tune up your data science process February 10, 2016 12 / 24
Validation of model
Check your model is a good (enough) representation of reality:
“All models are wrong but some are useful” – George Box
Run an experiment
Perform specification testing
Test assumptions hold
Beware of endogenous features
Benjamin S. Skrainka Tune up your data science process February 10, 2016 13 / 24
Approaches to experimentation
Many ways to test:
A/B test
Multi-armed bandit
Bayesian A/B test
Wald sequential analysis
Benjamin S. Skrainka Tune up your data science process February 10, 2016 14 / 24
Uncertainty quantification
There are many types of uncertainty which affect the robustness of your
model:
Parameter uncertainty
Structural uncertainty
Algorithmic uncertainty
Experimental uncertainty
Interpolation uncertainty
Classified as aleatoric (statistical) and epistemic (systematic)
Benjamin S. Skrainka Tune up your data science process February 10, 2016 15 / 24
Good habits
Benjamin S. Skrainka Tune up your data science process February 10, 2016 16 / 24
Act like a software engineer
Use best practices from software engineering:
Good design of code
Follow a sensible coding convention
Version control
Use same file structure for every project
Unit test
Use PEP8 or equivalent
Perform code reviews
Benjamin S. Skrainka Tune up your data science process February 10, 2016 17 / 24
Reproducible research
‘Document what you do and do what you document’:
Keep a journal!
Data provenance
How data was cleaned
Design document
Specification & requirements
Do you keep a journal? You should. Fermi taught me that. –
John A. Wheeler
Benjamin S. Skrainka Tune up your data science process February 10, 2016 18 / 24
Follow a workflow
Use a workflow like CRISP-DM:
1 Define business question and metric
2 Understand data
3 Prepare data
4 Build model
5 Evaluate
6 Deploy
Ensures you don’t forget any key steps
Benjamin S. Skrainka Tune up your data science process February 10, 2016 19 / 24
Automate your data pipeline
One-touch build of your application or paper:
Automate entire workflow from raw data to final result
Ensures you perform all steps
Ensures all steps are known – no one off manual adjustments
Avoids stupid human errors
Auto generate all tables and figures
Save time when handling new data . . . which always has subtle
changes in formatting
Benjamin S. Skrainka Tune up your data science process February 10, 2016 20 / 24
Write flexible code to handle data
Use constants/macros to access data fields:
Code will clearly show what data matters
Easier to understand code and data pipeline
Easier to debug data problems
Easier to handles changes in data formatting
Benjamin S. Skrainka Tune up your data science process February 10, 2016 21 / 24
Python example
# Setup indicators
ix_gdp = 7
...
# Load & clean data
m_raw = np.recfromcsv('bea_gdp.csv')
gdp = m_raw[:, ix_gdp]
...
Benjamin S. Skrainka Tune up your data science process February 10, 2016 22 / 24
Politics. . .
Often, there is political pressure to violate best practice:
Examples:
80% confidence intervals
Absurd attribution window
Two year forecast horizon but only three months of data
Hard to do right thing vs. senior management
Recruit a high-level scientist to advocate
Particularly common with forecasting:
Often requested by management for CYA
Insist on a ‘panel of experts’ for impossible decisions
Benjamin S. Skrainka Tune up your data science process February 10, 2016 23 / 24
Conclusion
Need to raise the quality of data science:
VV & UQ provides rigorous framework:
Verification: solve the equations right
Validation: solve the right equations
Uncertainty quantification: how robust is model to unknowns?
Adopting good habits provides huge gains for minimal effort
Benjamin S. Skrainka Tune up your data science process February 10, 2016 24 / 24

More Related Content

What's hot

Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead
Maarten van Smeden
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutions
Maarten van Smeden
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
Michał Łopuszyński
 
International Year of Statistics | 2013
International Year of Statistics | 2013International Year of Statistics | 2013
International Year of Statistics | 2013
SCA - Hygiene and Forest Products Company
 
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Ewout Steyerberg
 
Causal discovery
Causal discoveryCausal discovery
Causal discovery
dagunisa
 
Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...
BenVanCalster
 
Filling the gaps in translational research
Filling the gaps in translational researchFilling the gaps in translational research
Filling the gaps in translational research
Paul Agapow
 
Lecture 01 - Some basic terminology, History, Application of statistics - Def...
Lecture 01 - Some basic terminology, History, Application of statistics - Def...Lecture 01 - Some basic terminology, History, Application of statistics - Def...
Lecture 01 - Some basic terminology, History, Application of statistics - Def...
National College of Business Administration & Economics ( NCBA&E)
 
Introduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIIntroduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part II
Maarten van Smeden
 
Stayer mat 510 final exam2
Stayer mat 510 final exam2Stayer mat 510 final exam2
Stayer mat 510 final exam2
shyaminfo15
 
Machine learning in medicine: calm down
Machine learning in medicine: calm downMachine learning in medicine: calm down
Machine learning in medicine: calm down
BenVanCalster
 
Stout Healthcare Analytics Midwestern University
Stout Healthcare Analytics Midwestern UniversityStout Healthcare Analytics Midwestern University
Stout Healthcare Analytics Midwestern University
Dr. Chris Stout
 
Medical data diagnosis
Medical data diagnosisMedical data diagnosis
Medical data diagnosis
Bhargav Srinivasan
 
Stayer mat 510 final exam2
Stayer mat 510 final exam2Stayer mat 510 final exam2
Stayer mat 510 final exam2
vikscarter
 
Spring 2016
Spring 2016Spring 2016
Spring 2016
Jean Ramirez
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Vasily Leksin
 
Statistics Assignment Help
Statistics Assignment HelpStatistics Assignment Help
Statistics Assignment Help
hwmsocial
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio
sarahdijulio
 

What's hot (19)

Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead Why the EPV≥10 sample size rule is rubbish and what to use instead
Why the EPV≥10 sample size rule is rubbish and what to use instead
 
Development and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutionsDevelopment and evaluation of prediction models: pitfalls and solutions
Development and evaluation of prediction models: pitfalls and solutions
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 
International Year of Statistics | 2013
International Year of Statistics | 2013International Year of Statistics | 2013
International Year of Statistics | 2013
 
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
Prediction, Big Data, and AI: Steyerberg, Basel Nov 1, 2019
 
Causal discovery
Causal discoveryCausal discovery
Causal discovery
 
Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...Calibration of risk prediction models: decision making with the lights on or ...
Calibration of risk prediction models: decision making with the lights on or ...
 
Filling the gaps in translational research
Filling the gaps in translational researchFilling the gaps in translational research
Filling the gaps in translational research
 
Lecture 01 - Some basic terminology, History, Application of statistics - Def...
Lecture 01 - Some basic terminology, History, Application of statistics - Def...Lecture 01 - Some basic terminology, History, Application of statistics - Def...
Lecture 01 - Some basic terminology, History, Application of statistics - Def...
 
Introduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part IIIntroduction to prediction modelling - Berlin 2018 - Part II
Introduction to prediction modelling - Berlin 2018 - Part II
 
Stayer mat 510 final exam2
Stayer mat 510 final exam2Stayer mat 510 final exam2
Stayer mat 510 final exam2
 
Machine learning in medicine: calm down
Machine learning in medicine: calm downMachine learning in medicine: calm down
Machine learning in medicine: calm down
 
Stout Healthcare Analytics Midwestern University
Stout Healthcare Analytics Midwestern UniversityStout Healthcare Analytics Midwestern University
Stout Healthcare Analytics Midwestern University
 
Medical data diagnosis
Medical data diagnosisMedical data diagnosis
Medical data diagnosis
 
Stayer mat 510 final exam2
Stayer mat 510 final exam2Stayer mat 510 final exam2
Stayer mat 510 final exam2
 
Spring 2016
Spring 2016Spring 2016
Spring 2016
 
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on...
 
Statistics Assignment Help
Statistics Assignment HelpStatistics Assignment Help
Statistics Assignment Help
 
10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio10NTC - Data Superheroes - DiJulio
10NTC - Data Superheroes - DiJulio
 

Viewers also liked

Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modeling
Chi D. Nguyen
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive data
Ulf Mattsson
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
Kelly Technologies
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...
SlideTeam.net
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's Law
Domino Data Lab
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry DataA Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
Domino Data Lab
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
David Chiu
 
How to read a data model
How to read a data modelHow to read a data model
How to read a data model
sanksh
 
Build a predictive analytics model on a terabyte of data within hours
Build a predictive analytics model on a terabyte of data within hoursBuild a predictive analytics model on a terabyte of data within hours
Build a predictive analytics model on a terabyte of data within hours
DataWorks Summit
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
The data model is dead, long live the data model
The data model is dead, long live the data modelThe data model is dead, long live the data model
The data model is dead, long live the data model
Patrick McFadin
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
DKALab
 
UX x Analytics: Love or hate
UX x Analytics: Love or hateUX x Analytics: Love or hate
UX x Analytics: Love or hate
Christian Rohr
 
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Amazon Web Services
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Analytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data ScienceAnalytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data Science
Zeydy Ortiz, Ph. D.
 

Viewers also liked (20)

Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modeling
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive data
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's Law
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry DataA Tour of the Data Science Process, a Case Study Using Movie Industry Data
A Tour of the Data Science Process, a Case Study Using Movie Industry Data
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
How to read a data model
How to read a data modelHow to read a data model
How to read a data model
 
Build a predictive analytics model on a terabyte of data within hours
Build a predictive analytics model on a terabyte of data within hoursBuild a predictive analytics model on a terabyte of data within hours
Build a predictive analytics model on a terabyte of data within hours
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
The data model is dead, long live the data model
The data model is dead, long live the data modelThe data model is dead, long live the data model
The data model is dead, long live the data model
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Building a Predictive Model
Building a Predictive ModelBuilding a Predictive Model
Building a Predictive Model
 
UX x Analytics: Love or hate
UX x Analytics: Love or hateUX x Analytics: Love or hate
UX x Analytics: Love or hate
 
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
Explore Your Data Using Amazon QuickSight and Build Your First Machine Learni...
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Analytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data ScienceAnalytics>Forward - Design Thinking for Data Science
Analytics>Forward - Design Thinking for Data Science
 

Similar to Tune up your data science process

Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up Seattle
Domino Data Lab
 
No estimates - 10 new principles for testing
No estimates  - 10 new principles for testingNo estimates  - 10 new principles for testing
No estimates - 10 new principles for testing
Vasco Duarte
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
Prasant Misra
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
Lisa Cohen
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
Dr. Umesh Rao.Hodeghatta
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Edureka!
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Precisely
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
stelligence
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Online
sfdatascience
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining Projects
Data Science Warsaw
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
Vishal Patel
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Greg Makowski
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_public
plka13
 
Data mining is the statistical technique of processing raw data in a structur...
Data mining is the statistical technique of processing raw data in a structur...Data mining is the statistical technique of processing raw data in a structur...
Data mining is the statistical technique of processing raw data in a structur...
ssuser6478a8
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
Databricks
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
GautamPopli1
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist
Nuno Carneiro
 
Some insights from a Systematic Mapping Study and a Systematic Review Study: ...
Some insights from a Systematic Mapping Study and a Systematic Review Study: ...Some insights from a Systematic Mapping Study and a Systematic Review Study: ...
Some insights from a Systematic Mapping Study and a Systematic Review Study: ...
Phu H. Nguyen
 

Similar to Tune up your data science process (20)

Correctness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up SeattleCorrectness in Data Science - Data Science Pop-up Seattle
Correctness in Data Science - Data Science Pop-up Seattle
 
No estimates - 10 new principles for testing
No estimates  - 10 new principles for testingNo estimates  - 10 new principles for testing
No estimates - 10 new principles for testing
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Tips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data ScientistTips and Tricks to be an Effective Data Scientist
Tips and Tricks to be an Effective Data Scientist
 
Challenges of Executing AI
Challenges of Executing AIChallenges of Executing AI
Challenges of Executing AI
 
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
Data Science For Beginners | Who Is A Data Scientist? | Data Science Tutorial...
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
 
Claudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science OnlineClaudia Gold: Learning Data Science Online
Claudia Gold: Learning Data Science Online
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
CRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining ProjectsCRISP-DM Agile Approach to Data Mining Projects
CRISP-DM Agile Approach to Data Mining Projects
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
DataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_publicDataAnalyticsLC_20180410_public
DataAnalyticsLC_20180410_public
 
Data mining is the statistical technique of processing raw data in a structur...
Data mining is the statistical technique of processing raw data in a structur...Data mining is the statistical technique of processing raw data in a structur...
Data mining is the statistical technique of processing raw data in a structur...
 
Disrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging TechnologiesDisrupting Risk Management through Emerging Technologies
Disrupting Risk Management through Emerging Technologies
 
Breed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptxBreed data scientists_ A Presentation.pptx
Breed data scientists_ A Presentation.pptx
 
10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist10 Tips From A Young Data Scientist
10 Tips From A Young Data Scientist
 
Some insights from a Systematic Mapping Study and a Systematic Review Study: ...
Some insights from a Systematic Mapping Study and a Systematic Review Study: ...Some insights from a Systematic Mapping Study and a Systematic Review Study: ...
Some insights from a Systematic Mapping Study and a Systematic Review Study: ...
 

Recently uploaded

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 

Recently uploaded (20)

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 

Tune up your data science process

  • 1. Tune up your data science process Benjamin S. Skrainka February 10, 2016 Benjamin S. Skrainka Tune up your data science process February 10, 2016 1 / 24
  • 2. The correctness problem A lot of (data) science is unscientific: “My code runs, so the answer must be correct” “It passed Explain Plan, so the answer is correct” “This model is too complex to have a design document” “It is impossible to unit test scientific code” “The lift from the direct mail campaign is 10%” Benjamin S. Skrainka Tune up your data science process February 10, 2016 2 / 24
  • 3. Correctness matters Bad (data) science: Costs real money and can kill people Will eventually damage your reputation and career Could expose you to litigation An issue of basic integrity and sleeping at night Benjamin S. Skrainka Tune up your data science process February 10, 2016 3 / 24
  • 4. Objectives Today’s goals: Introduce VV&UQ framework to evaluate correctness of scientific models Survey good habits to improve quality of your work Benjamin S. Skrainka Tune up your data science process February 10, 2016 4 / 24
  • 5. Verification, Validation, & Uncertainty Quantification Benjamin S. Skrainka Tune up your data science process February 10, 2016 5 / 24
  • 6. Introduction to VV&UQ Verification, Validation, & Uncertainty Quantification provides epistemological framework to evaluate correctness of scientific models: Evidence of correctness should accompany any prediction In absence of evidence, assume predictions are wrong Popper: can only disprove or fail to disprove a model VV&UQ is inductive whereas science is deductive Reference: Verification and Validation in Scientific Computing by Oberkampf & Roy Benjamin S. Skrainka Tune up your data science process February 10, 2016 6 / 24
  • 7. Definitions of VV&UQ Definitions of terms (Oberkampf & Roy): Verification: “solving equations right” I.e., code implements the model correctly Validation: “solving right equations” I.e., model has high fidelity to reality Definitions of VV&UQ will vary depending on source . . . → Most organizations do not even practice verification. . . Benjamin S. Skrainka Tune up your data science process February 10, 2016 7 / 24
  • 8. Definition of UQ Definition of Uncertainty Quantification (Oberkampf & Roy): Process of identifying, characterizing, and quantifying those factors in an analysis which could affect accuracy of computational results Do your assumptions hold? When do they fail? Does your model apply to the data/situation? Where does your model break down? What are its limits? Benjamin S. Skrainka Tune up your data science process February 10, 2016 8 / 24
  • 9. Verification of code Does your code implement the model correctly? Unit test everything you can: Scientific code can be unit tested Test special cases Test on cases with analytic solutions Test on synthetic data Unit test framework will setup and tear-down fixtures Should be able to recover parameters from Monte Carlo data Benjamin S. Skrainka Tune up your data science process February 10, 2016 9 / 24
  • 10. Verification of SQL Passing Explain Plan doesn’t mean your SQL is correct: Garbage in, garbage out Check a simple case you can compute by hand Check join plan is correct Check aggregate statistics Check answer is compatible with reality Benjamin S. Skrainka Tune up your data science process February 10, 2016 10 / 24
  • 11. Unit test import unittest2 as unittest import assignment as problems class TestAssignment(unittest.TestCase): def test_zero(self): result = problems.question_zero() self.assertEqual(result, 9198) ... if __name__ == '__main__': unittest.main() Benjamin S. Skrainka Tune up your data science process February 10, 2016 11 / 24
  • 12. Unit test Figure 1:Benjamin S. Skrainka Tune up your data science process February 10, 2016 12 / 24
  • 13. Validation of model Check your model is a good (enough) representation of reality: “All models are wrong but some are useful” – George Box Run an experiment Perform specification testing Test assumptions hold Beware of endogenous features Benjamin S. Skrainka Tune up your data science process February 10, 2016 13 / 24
  • 14. Approaches to experimentation Many ways to test: A/B test Multi-armed bandit Bayesian A/B test Wald sequential analysis Benjamin S. Skrainka Tune up your data science process February 10, 2016 14 / 24
  • 15. Uncertainty quantification There are many types of uncertainty which affect the robustness of your model: Parameter uncertainty Structural uncertainty Algorithmic uncertainty Experimental uncertainty Interpolation uncertainty Classified as aleatoric (statistical) and epistemic (systematic) Benjamin S. Skrainka Tune up your data science process February 10, 2016 15 / 24
  • 16. Good habits Benjamin S. Skrainka Tune up your data science process February 10, 2016 16 / 24
  • 17. Act like a software engineer Use best practices from software engineering: Good design of code Follow a sensible coding convention Version control Use same file structure for every project Unit test Use PEP8 or equivalent Perform code reviews Benjamin S. Skrainka Tune up your data science process February 10, 2016 17 / 24
  • 18. Reproducible research ‘Document what you do and do what you document’: Keep a journal! Data provenance How data was cleaned Design document Specification & requirements Do you keep a journal? You should. Fermi taught me that. – John A. Wheeler Benjamin S. Skrainka Tune up your data science process February 10, 2016 18 / 24
  • 19. Follow a workflow Use a workflow like CRISP-DM: 1 Define business question and metric 2 Understand data 3 Prepare data 4 Build model 5 Evaluate 6 Deploy Ensures you don’t forget any key steps Benjamin S. Skrainka Tune up your data science process February 10, 2016 19 / 24
  • 20. Automate your data pipeline One-touch build of your application or paper: Automate entire workflow from raw data to final result Ensures you perform all steps Ensures all steps are known – no one off manual adjustments Avoids stupid human errors Auto generate all tables and figures Save time when handling new data . . . which always has subtle changes in formatting Benjamin S. Skrainka Tune up your data science process February 10, 2016 20 / 24
  • 21. Write flexible code to handle data Use constants/macros to access data fields: Code will clearly show what data matters Easier to understand code and data pipeline Easier to debug data problems Easier to handles changes in data formatting Benjamin S. Skrainka Tune up your data science process February 10, 2016 21 / 24
  • 22. Python example # Setup indicators ix_gdp = 7 ... # Load & clean data m_raw = np.recfromcsv('bea_gdp.csv') gdp = m_raw[:, ix_gdp] ... Benjamin S. Skrainka Tune up your data science process February 10, 2016 22 / 24
  • 23. Politics. . . Often, there is political pressure to violate best practice: Examples: 80% confidence intervals Absurd attribution window Two year forecast horizon but only three months of data Hard to do right thing vs. senior management Recruit a high-level scientist to advocate Particularly common with forecasting: Often requested by management for CYA Insist on a ‘panel of experts’ for impossible decisions Benjamin S. Skrainka Tune up your data science process February 10, 2016 23 / 24
  • 24. Conclusion Need to raise the quality of data science: VV & UQ provides rigorous framework: Verification: solve the equations right Validation: solve the right equations Uncertainty quantification: how robust is model to unknowns? Adopting good habits provides huge gains for minimal effort Benjamin S. Skrainka Tune up your data science process February 10, 2016 24 / 24