PyData Berlin 2018: dvc.org

Data versioning in
machine learning
projects
Dmitry Petrov
dmitry@iterative.ai
PyData Berlin 2018

Agenda
1. What makes ML process special?
2. Data files
3. ML pipelines and reproducibility
4. Data science workflow
5. Beyond the Horizon

1
Dmitry Petrov
Twitter: @FullStackML
PhD in Computer Science
Hello
Co-Founder & CEO at Iterative.AI. San Francisco, USA
ex-Data Scientist at Microsoft (BingAds). Seattle, USA
ex-Head of Lab at St. Petersburg Electrotechnical University, Russia

Chapter I
What makes ML process special?

Data files hell problem
1. Data files are not in your repository.
2. Tons of data file versions:
- model.pkl
- model_L7_e120.pkl
- model_vgg16_L5tune_e120.pkl
- model_L7_e160_cleansed.pkl
- model_vgg16_L45tune_e120.pkl
- model_vgg16_L45tune_e160_noempty.pkl
- …
3. Data files are not connected to code files.
- $ git checkout finetune_head # creates even more mess

Data files hell in a team
1. How to create a reproducible ML project?
2. How to scale ML process in a team:
a. feature extraction
b. a current model tuning
c. experimenting with new models
3. How to pass ML model to deployment or revert a model (to
devops)

Methodology mismatch
“Data science as different from software as software was
different from hardware”
https://dominodatalab.wistia.com/medias/fq0l4152sh
Agile development methodology should cover data science.
Hardware Software Data science/ML
Methodology Waterfall Agile/Scrum Agile/?

What is special about Data Science?
1
New artifacts to manage:
● Experiment: Code + Data files.
● Metrics.
● ML pipelines and reproducibility.
Different process:
● R&D like. A lot of trials and errors, progress should be measured in a
different way.
● Ephemerality. Hard to communicate and track the progress.
* The image from: https://www.customsigns.com/experiment-fail-learn-repeat-poster-sign-18-x-24

DVC project motivation
Open source tool Data Version Control to manage ML projects:
http://dvc.org
DVC is a data science platform on top of open source stack.
GitHub repo: https://github.com/iterative/dvc
Download binaries (Mac, Linux, Windows) or $ pip install dvc
It extends Git by commands: dvc add, dvc run, dvc repro, dvc
remote

● Experiment as commitbranch: Code + Data files.
● Large data files:
○ Local cache.
○ Optimized for 1Gb - 100Gb file size.
○ Data remotes: S3, GCP, SSH.
● Metrics per experiment.
● ML pipelines.
● Reproducibility.
What DVC does?

Existing solutions
Git-LFS What is required
A single file size < 2Gb 1Gb - 100Gb
Workspace size (all
files)
Slow if 5Gb+ Unlimited
Not garbage collector
for data
20 experiments by 5Gb
each ~= 100Gb
Remove data files from
some of experiments
Data storage Proprietary and paid: only
GitHub and GitLab.
S3, GCP or custom
server (rsync, SFTP)

DVC: checkout and optimization
Optimizations:
1. No data file copying - hardlinks copy instead.
2. Checksum caching and timesteps tracking.
3. Supports reflinks (CoW - Copy on Write) in modern file systems: BTRFS,
ReFS, XFS.
As a result: 100Gb data file checkout works instantaneously.

Chapter III
ML pipelines and reproducibility

A simple pipeline
Pipeline: images.zip → images/ → model.p → plots.jpg
Specify: input (-d), output (-o) and command.

Reproducibility
DVC reproduces ML pipeline in a single command:
Any DAG (Directed acyclic graph) is supported.

Chapter IV
Data science workflow

Workflow change is needed
Methodology → Workflows → Tools
Git is flexible: you can define your workflow.
master
new_feature

Git workflows: from software to ML
master
new_feature
Gitflow: feature driven
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810

Data science flow: why new workflow?
increase_beta
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
Different people can
work on different ideas.
Collaborate without
waiting.

Data science flow: hints
increase_beta
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
1. Do not forget to create
branches:
$ git checkout master
$ git checkout -b alpha_change master
2. Keep failed experiments:
$ git checkout alpha_change
$ git push master alpha_change
3. Clean up not important
experiments.

Special DVC scenarios
1. Tracking data files - like Git-LFS but S3GCPSSH backend.
2. ML model deployment tool.
3. Experimentation on HDFS/Apache Spark.

When you need DVC?
DVC is a data science platform on top of open source stack.
It uses some ideas from existing data science platforms but uses
open source stack and Git as a foundation.
Data science platforms helps on creating ML projects in teams
(3+ members).

Thank you!1
Questions
Twitter: @FullStackML
Email: dmitry@iterative.ai
Discuss: discuss.dvc.org
Actions
Visit dvc.org
Star github.com/iterative/dvc

PyData Berlin 2018: dvc.org

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PyData Berlin 2018: dvc.org

Similar to PyData Berlin 2018: dvc.org (20)

Recently uploaded

Recently uploaded (20)

PyData Berlin 2018: dvc.org