Data versioning in machine learning projects. How is data science different from software engineering? Is there a methodology mismatch? How to use dvc.org to version data and experiments.
2. Agenda
1. What makes ML process special?
2. Data files
3. ML pipelines and reproducibility
4. Data science workflow
5. Beyond the Horizon
3. 1
Dmitry Petrov
Twitter: @FullStackML
PhD in Computer Science
Hello
Co-Founder & CEO at Iterative.AI. San Francisco, USA
ex-Data Scientist at Microsoft (BingAds). Seattle, USA
ex-Head of Lab at St. Petersburg Electrotechnical University, Russia
5. Data files hell problem
1. Data files are not in your repository.
2. Tons of data file versions:
- model.pkl
- model_L7_e120.pkl
- model_vgg16_L5tune_e120.pkl
- model_L7_e160_cleansed.pkl
- model_vgg16_L45tune_e120.pkl
- model_vgg16_L45tune_e160_noempty.pkl
- …
3. Data files are not connected to code files.
- $ git checkout finetune_head # creates even more mess
6. Data files hell in a team
1. How to create a reproducible ML project?
2. How to scale ML process in a team:
a. feature extraction
b. a current model tuning
c. experimenting with new models
3. How to pass ML model to deployment or revert a model (to
devops)
7. Methodology mismatch
“Data science as different from software as software was
different from hardware”
https://dominodatalab.wistia.com/medias/fq0l4152sh
Agile development methodology should cover data science.
Hardware Software Data science/ML
Methodology Waterfall Agile/Scrum Agile/?
8. What is special about Data Science?
1
New artifacts to manage:
● Experiment: Code + Data files.
● Metrics.
● ML pipelines and reproducibility.
Different process:
● R&D like. A lot of trials and errors, progress should be measured in a
different way.
● Ephemerality. Hard to communicate and track the progress.
* The image from: https://www.customsigns.com/experiment-fail-learn-repeat-poster-sign-18-x-24
9. DVC project motivation
Open source tool Data Version Control to manage ML projects:
http://dvc.org
DVC is a data science platform on top of open source stack.
GitHub repo: https://github.com/iterative/dvc
Download binaries (Mac, Linux, Windows) or $ pip install dvc
It extends Git by commands: dvc add, dvc run, dvc repro, dvc
remote
10. ● Experiment as commitbranch: Code + Data files.
● Large data files:
○ Local cache.
○ Optimized for 1Gb - 100Gb file size.
○ Data remotes: S3, GCP, SSH.
● Metrics per experiment.
● ML pipelines.
● Reproducibility.
What DVC does?
12. Existing solutions
Git-LFS What is required
A single file size < 2Gb 1Gb - 100Gb
Workspace size (all
files)
Slow if 5Gb+ Unlimited
Not garbage collector
for data
20 experiments by 5Gb
each ~= 100Gb
Remove data files from
some of experiments
Data storage Proprietary and paid: only
GitHub and GitLab.
S3, GCP or custom
server (rsync, SFTP)
17. DVC: checkout and optimization
Optimizations:
1. No data file copying - hardlinks copy instead.
2. Checksum caching and timesteps tracking.
3. Supports reflinks (CoW - Copy on Write) in modern file systems: BTRFS,
ReFS, XFS.
As a result: 100Gb data file checkout works instantaneously.
22. Workflow change is needed
Methodology → Workflows → Tools
Git is flexible: you can define your workflow.
master
new_feature
23. Git workflows: from software to ML
master
new_feature
Gitflow: feature driven
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
24. Data science flow: why new workflow?
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
Different people can
work on different ideas.
Collaborate without
waiting.
25. Data science flow: hints
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
1. Do not forget to create
branches:
$ git checkout master
$ git checkout -b alpha_change master
2. Keep failed experiments:
$ git checkout alpha_change
$ git push master alpha_change
3. Clean up not important
experiments.
27. Special DVC scenarios
1. Tracking data files - like Git-LFS but S3GCPSSH backend.
2. ML model deployment tool.
3. Experimentation on HDFS/Apache Spark.
28. When you need DVC?
DVC is a data science platform on top of open source stack.
It uses some ideas from existing data science platforms but uses
open source stack and Git as a foundation.
Data science platforms helps on creating ML projects in teams
(3+ members).