Data science workflows: from notebooks to production

Data Science Workflows:
from notebooks to production
Marissa Saunders • 11.01.2019

Thanks to our Sponsors!
Partners
Premier
Logo:

Overview
1. Data science workflow - an
abstraction
2. What is important?
3. Some tools and best practices
a. Setting up a workspace
b. Managing environments
c. Structuring experimental work
d. Refining the pipeline
e. Annotating data
f. Approaches to deployment
g. Documentation best practices

Let’s imagine ...
You’re about to start a new project.
● What should you think about before starting?
● How do you organize your workflow?
● Where should your data and code go?
● How do you prioritize experimentation, documentation, clean code, and
reproducibility and still have reasonable timelines?

The workflow, abstracted
Data Access
Data Processing /
Feature Creation
Modeling Predictions & Reporting
Basic pipeline
Exploratory Analysis
Experiments
Production
Process
Refinement

What’s important?
Criteria for a good workflow
● Reproducible
● Easy
○ to add / remove features
○ to switch pipeline parts
○ to deploy
○ to come back to 6 months later
Data Access
Data
Processing /
Feature
Creation
Modeling
Predictions &
Reporting
Basic pipeline
Exploratory
Analysis
Experiments
Production
Process
Refinement

What trade-offs are there?
Fast to
develop/
iterate
Clean code
Fast execution
Well-documented
Scalable

Setting up a workspace
● Standardized directory structure
● Promotes good development practice
○ Separate exploration,pipeline, and reporting
● Low overhead to get going

Setting up a workspace
cookiecutter-data-science
● http://drivendata.github.io/cookiecutter-data-science
● Based on the cookie cutter package
○ https://cookiecutter.readthedocs.io/en/latest/readme.html
● Standardized directory structure that works well out of the box
● README.md that documents structure
● Standard .gitignore file is set up
● Structure is set up to be pip install-able
● Set-up for Sphinx
● Set-up for tox (standardize testing)

Managing Environments
Build from the environment up
● Package managers
○ virtualenv or conda -> requirements.txt
○ docker or vagrant -> dockerfile / vagrant file
● Keep secrets secret
○ .env and .conf files

Organizing experimental work
Notebooks are for exploration
● Number notebooks for
ordering
● Add dates to top of notebooks
to help track when changes
happen
● For collaboration, add author
initials

Exploration to Experiments to Production
● Natural refinement of the code
○ Notebooks
○ Functions
○ Classes
○ Packages
○ Python scripts
● Other considerations
○ Memory management
■ Data streaming
■ Training in batches
○ Moving between platforms
○ Managing metrics and reporting
Exploratory Analysis
Experiments
Production
Process
Refinement

Refining the pipeline
Version control, comparing experiments, and
reproducibility
● What about git?

Version control and reproducibility
● What about git?
● How about git-LFS?

Version control and reproducibility
● What about git?
● How about git-LFS?
● What about DVC - data version control

DVC
Data science process is a DAG
● DVC keeps track of code, dependencies and output allowing any step to be
reproduced if there are upstream changes
Separate code storage from data and models
● DVC integrates with git
○ Git stores the code and the dvc files (that store the graph)
○ dvc remotes store the data and the models

Working with DVC
Initialize
dvc init
Configure
dvc remote
Add files to be tracked by DVC
dvc add
Store/retrieve data
dvc push dvc pull

Working with DVC
Define steps in the DAG
dvc run -f sample.dvc -d cmd.py -d input.data
-M metric.json -o output.data
python cmd.py input.data output.data metrics.json
-f: dvc file to store information in
-d: any dependencies
-M: metric files
-o: output files

Experiments with DVC
Make changes with bigrams
Set a git checkpoint for the baseline
View metrics

Getting to production
So you have a model …
and the metrics look good …
Now what?
● Human review of results
● Figuring out how to use it in production

Human review and
annotation
Viewing, annotating, and
storing human reviews
can be a challenge
Prodigy
https://prodi.gy/demo?view_id=ner

Approaches to deployment
Run a model once;
Store results in a table
One-time
● Simple structure
● Predictions can
become dated
● Requires manual
updates
Run new predictions
regularly;
Store results in a table
Batch
● Loose integration with
production
● Little engineering
effort required
● Predictions are
somewhat up-to-date
Run predictions in real-
time
API
● Realtime predictions
● More engineering and
reliability testing
required

Documentation
Layers of documentation
● Code comments
● Daily notes / Working notes
● Checkpoints/Summaries
● Code books / how to run
● Project Summary
● Index

How to put this into practice?
● Checklists
● Routines
● Sticky-note method

Cookiecutter-data science resources
https://drivendata.github.io/cookiecutter-data-science/#cookiecutter-data-science
https://github.com/drivendata/cookiecutter-data-science

DVC resources
https://dvc.org/doc/get-started
https://blog.usejournal.com/version-control-for-data-science-tracking-your-machine-learning-models-and-
datasets-aaa61f20bb45
https://christophergs.github.io/machine%20learning/2019/05/13/first-impressions-of-dvc/

Data science workflows: from notebooks to production

Recommended

Recommended

More Related Content

Similar to Data science workflows: from notebooks to production

Similar to Data science workflows: from notebooks to production (20)

Recently uploaded

Recently uploaded (20)

Data science workflows: from notebooks to production