SlideShare a Scribd company logo
DVC
Version Control System for Machine Learning Projects
Francesco Casalegno
What is DVC?
2
Francesco Casalegno – DVC
● Simple command line Git-like experience.
○ Does not require installing and maintaining any databases.
○ Does not depend on any proprietary online services.
● Management and versioning of datasets and ML models.
○ Data is saved in S3, Google cloud, Azure, SSH server, HDFS, or even local HDD RAID.
● Makes projects reproducible and shareable; answers questions on how a model was built.
● Helps manage experiments with Git tags/branches and metrics tracking.
“DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs)
which are being used frequently as both knowledge repositories and team ledgers.
DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as
well as ad-hoc data file suffixes and prefixes.”
What is DVC?[ref]
3
Francesco Casalegno – DVC
● dvc and git
○ git: version code, small files
○ dvc: version data, intermed. results, models
○ dvc uses git, w/o storing file content in repo
● versioning and storing large files
○ dvc save info on data in special .dvc files
○ .dvc files can then be versioned using git
○ actual storage happens w remote storage
○ dvc supports many remote storage types
● dvc main features
○ data versioning
○ data access
○ data pipelines
What is DVC?
4
Getting Started
5
Francesco Casalegno – DVC
Install
6
$ pip install dvc
● Depending on remote storage you will use, you may want to install specific dependencies.
$ pip install dvc[s3] # support Amazon S3
$ pip install dvc[ssh] # support ssh
$ pip install dvc[all] # all supports
● Install as a python package.
Francesco Casalegno – DVC
$ mkdir ml_project & cd ml_project
$ git init
Initialization
● We must work inside a Git repository. If it does not exist yet, we create and initialize one.
7
$ dvc init
$ git status -s
A .dvc/.gitignore
A .dvc/config
A .dvc/plots/confusion.json
A .dvc/plots/default.json
A .dvc/plots/scatter.json
A .dvc/plots/smooth.json
A .dvcignore
$ git commit -m "Initialize dvc project"
● Initializing a DVC project creates and automatically git add a few important files.
Tell Git not to track .dvc/cache and .dvc/tmp
TOML file with configurations for
- dvc remote storage — name, url
- dvc cache – reflink/copy/hardlink, location, ...
- ...
Tell dvc what not to track (empty for now)
Plot templates (visualize & compare metrics)
Data Versioning
8
v1.0 v2.0 v3.0
Francesco Casalegno – DVC
data
├── train
│ ├── dogs # 500 pictures
│ └── cats # 500 pictures
└── validation
├── dogs # 400 pictures
└── cats # 400 pictures
● This folder contains 43 MB of JPG figures organized in a hierarchical fashion.
Getting some data
● Let’s download some data to train and validate a “cat VS dog” CNN classifier.
We use dvc get, which is like wget to download data/models from a remote dvc repo.
9
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip
$ unzip data.zip & rm -f data.zip
inflating: data/train/cats/cat.001.jpg
...
Francesco Casalegno – DVC
Start versioning data
● Tracking data with DVC is very similar to tracking code with git.
10
$ dvc add data/
100% Add|██████████|1/1 [00:30, 30.51s/file]
To track the changes with git, run:
git add .gitignore data.dvc
$ git add .gitignore data.dvc
$ git commit -m "Add first version of data/"
$ git tag -a "v1.0" -m "data v1.0, 1000 images"
Tell git not to track the data/ directory
DVC-generated, contains hash to track data/ :
outs:
- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
path: data
● Quite a few things happened when calling dvc add :
○ The hash of the content of data/ was computed and added to a new data.dvc file
○ DVC updates .gitignore to tell Git not to track the content of data/
○ The physical content of data/ —i.e. the jpg images— has been moved to a cache
(by default the cache is located in .dvc/cache/ but using a remote cache is possible!)
○ The files were linked back to the workspace so that it looks like nothing happened
(the user can configure the link type to use: hard link, soft link, reflink, copy)
→ human readable, can be versioned with git!
Francesco Casalegno – DVC
● To track the changes in our data with dvc, we follow the same procedure as before.
Make changes to tracked data (add)
11
● Let’s download some more data for our “cat VS dog” dataset.
Running dvc diff will confirm that dvc is aware the data has changed!
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
$ unzip new-labels.zip & rm -f new-labels.zip
inflating: data/train/cats/cat.501.jpg
...
$ dvc diff
Modified:
data/
$ dvc add data/
$ git diff data.dvc
outs:
-- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
+- md5: 21060888834f7220846d1c6f6c04e649.dir
path: data
$ git commit -am "New version of data/ with more training images"
$ git tag -a "v2.0" -m "data v2.0, 2000 images"
Francesco Casalegno – DVC
● To fix this mismatch we simply call dvc checkout.
This reads the cache and updates the data in the
workspace based on the current *.dvc files.
Switch between versions
12
● To switch version, first run git checkout.
This affects data.dvc but not workspace files in data/ !
$ git checkout v1.0
$ dvc diff
Modified:
data/
$ dvc checkout
M data/
$ dvc status
Data and pipelines are up to date.
Working with Storages
13
Francesco Casalegno – DVC
$ mkdir -p ~/tmp/dvc_storage
$ dvc remote add --default loc_remote ~/tmp/dvc_storage
Setting 'loc_remote' as a default remote.
$ git add .dvc/config
$ git commit -m "Configure remote storage loc_remote"
● Many remote storages are supported (Google Drive, Amazon S3, Google Cloud, SSH, HDFS, HTTP, …)
But we (as for Git) nothing prevents us to use a “local remote”!
● A remote storage is for dvc, what a GitHub is for git:
○ push and pull files from your workspace to the remote
○ easy sharing between developers
○ safe backup should you ever do a terrible mistake à la rm -rf *
Configure remote storage
14
DVC-generated, contains remote storage config
[core]
remote = loc_remote
['remote "loc_remote"']
url = /root/tmp/dvc_storage
Francesco Casalegno – DVC
$ dvc push
1800 files pushed
$ rm -rf .dvc/cache data
$ dvc pull # update .dvc/cache with contents from remote
1800 files fetched
$ dvc checkout # update workspace, linking data from .dvc/cache
A data/
● Now, even if all the data is deleted from our workspace and cache, we can download it with dvc pull.
This is pretty much like git pull.
● Running basically dvc push uploads the content of the cache to the remote storage.
This is pretty much like git push.
Storing, sharing, retrieving from storage
15
Francesco Casalegno – DVC
$ dvc list https://github.com/iterative/dataset-registry
README.md
get-started/
tutorial/
...
$ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
● When working outside of a DVC project —e.g. in automated ML model deployment— use dvg get
● First, we can explore the content of a DVC repo hosted on a Git server.
Access data from storage
16
● Note. For all these commands we can specify a git revision (sha, branch, or tag) with --rev <commit>.
● When working inside of another DVC project, we want to keep the connection between the projects.
In this way, others can know where the data comes from and whether new versions are available.
dvc import is like dvc get + dvc add, but the resulting data.dvc also includes a ref to the source repo!
$ dvc import https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
$ git add new-labels.zip.dvc .gitignore
$ git commit -m "Import data from source"
Francesco Casalegno – DVC
● We can build a DVC project dedicated only to tracking and versioning datasets and models.
The repository would have all the metadata and history of changes in the different datasets.
● This a data registry, a middleware between ML projects and cloud storage.
This introduces quite a few advantages.
○ Reusability — reproduce and organize feature stores with a simple dvc get / import
○ Optimization — track data shared by multiple projects centralized in a single location
○ Data as code — leverage Git workflow such as commits, branching, pull requests, CI/CD …
○ Persistence — a DVC registry-controlled remote storage improves data security
● But versioning large data files for data science is great, is not all DVC can do:
DVC data pipelines capture how is data filtered, transformed, and used to train models!
Data Registries
17
Data Pipelines
18
Francesco Casalegno – DVC
● With dvc add we can track large files—this includes files such as: trained models, embeddings, etc.
However, we also want to track how such files were generated for reproducibility and better tracking!
Motivation
The following is an example of a typical ML pipeline. Its structure is a DAG (direct acyclic graph).
19
data.xml
train.tsv
test.tsv
train.pkl
test.pkl
model.pkl
scores.json
prc.json
Prepare
train-test split
◾ seed
◾ split
prepare.py
Featurize
TF-IDF embed
◾ max_feats
◾ n_grams
featurize.py
Train
RandomForest
◾ seed
◾ n_estimators
train.py
Evaluate
PR curve, AUC
evaluate.py
Stage Name
◾ parameter
script
input
output
Francesco Casalegno – DVC
→ Advantages of Option B
1. outputs are automatically tracked (i.e. saved in .dvc/cache)
2. pipeline stages with parameters names are saved in dvc.yaml
3. deps, params, outs are all hashed and tracked in dvc.lock
4. like a Makefile, can reproduce by dvc run prepare—re-run only if deps changed!
● Option A: run pipeline stages, then track output artifacts with dvc add
$ python src/prepare.py data/data.xml
$ dvc add data/prepared/train.tsv data/prepared/train.tsv
Tracking ML Pipelines
20
$ dvc run -n prepare 
-p prepare.seed 
-p prepare.split 
-d src/prepare.py 
-d data/data.xml 
-o data/prepared 
python src/prepare.py data/data.xml
● Option B: run pipeline stage and track them together with all dependencies with dvc run
stage name
parameters — read from params.yaml
dependencies (including script!)
outputs to track
prepare:
split: 0.20
seed: 20170428
featurize:
max_feats: 500
ngrams: 1
...
Francesco Casalegno – DVC
$ tree
.
├── data
│ ├── data.xml
│ └── data.xml.dvc
├── params.yaml
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py
$ mkdir nlp_project & cd nlp_project
$ git init & dvc init & git commit -m "Init dvc repo"
● Let’s create a DVC repo for an NLP project.
Example
21
● Then we download some data + some code to prepare the data and train/evaluate a model
$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml 
-o data/data.xml
$ dvc add data/data.xml
$ git add data/.gitignore data/data.xml.dvc & git commit -m "Add data, first version"
$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip & rm -f code.zip
prepare:
split: 0.20
seed: 20170428
featurize:
max_features: 500
ngrams: 1
train:
seed: 20170428
n_estimators: 50
YAML file with params for all the stages
pipeline steps
Francesco Casalegno – DVC
$ dvc run -n prepare 
-p prepare.seed 
-p prepare.split 
-d src/prepare.py 
-d data/data.xml 
-o data/prepared 
python src/prepare.py data/data.xml
$ git add data/.gitignore dvc.yaml dvc.lock
● Let’s run the prepare stage.
Example
22
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: a304afb96060aad90176268345e10355
- path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- params: data/prepared
md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
Describe data pipelines, similar to how
Makefiles work for building software.
Matches the dvc.yaml file.
Created and updated by DVC
commands like dvc run.
Describes latest pipeline state for:
1. track intermediate and final
artifacts (like a .dvc file)
2. allow DVC to detect when stage
defs or dependencies changed,
triggering re-run.
● Note: dependencies and artifacts are automatically tracked, no need to dvc add them!
Francesco Casalegno – DVC
$ dvc run -n featurize 
-p featurize.max_features 
-p featurize.ngrams 
-d src/featurize.py 
-d data/prepared 
-o data/features 
python src/featurization.py data/prepared data/features
$ git add data/.gitignore dvc.yaml dvc.lock
● Then we run the featurize and train stages in the same way.
Example
23
$ dvc run -n train 
-p train.seed 
-p train.n_estimators 
-d src/train.py 
-d data/features
-o model.pkl
python src/train.py data/features model.pkl
$ git add data/.gitignore dvc.yaml dvc.lock
Francesco Casalegno – DVC
$ dvc run -n evaluate 
-d src/evaluate.py 
-d model.pkl 
-d data/features 
--metrics-no-cache scores.json 
--plots-no-cache prc.json 
python src/evaluate.py model.pkl data/features scores.json prc.json
$ git add dvc.yaml dvc.lock
● And finally we run the evaluation stage.
Example
24
Declare output metrics file.
A special kind of output file (-o), must be
JSON and can be used to make comparisons
across experiments in a tabular form.
E.g. here it contains data for AUC score.
The -no-cache prevents DVC to store the file
in cache.
Declare output plot file.
A special kind of output file (-o), must be
JSON and can be used to make comparisons
across experiments in a plot form.
E.g. here it contains data for ROC curve plot.
The -no-cache prevents DVC to store the file
in cache.
Francesco Casalegno – DVC
$ dvc dag
+-------------------+
| data/data.xml.dvc |
+-------------------+
*
*
*
+---------+
| prepare |
+---------+
*
*
*
+-----------+
| featurize |
+-----------+
** **
** *
* **
+-------+ *
| train | **
+-------+ *
** **
** **
* *
+----------+
| evaluate |
+----------+
Plot dependency graphs
25
Francesco Casalegno – DVC
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Data and pipelines are up to date.
● dvc repro regenerate data pipeline results, by restoring the DAG defined by stages listed in dvc.yaml.
This compares file hashes with dvc.lock to re-run only if needed. This is like make in software builds.
Reproducing Pipelines
26
● Case 1: nothing changed, re-running pipeline stages is skipped.
● Case 2: a dependency changed, pipeline stages are re-run if needed.
$ sed -i -e "s@max_features: 500@max_features: 1500@g" params.yaml
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize' with command:
python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train' with command:
python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
Francesco Casalegno – DVC
$ dvc plots diff -x recall -y precision
file:///Users/dvc/example-get-started/plots.html
$ dvc params diff
Path Param Old New
params.yaml featurize.max_features 500 1500
● dvc params diff rev_1 rev_2 shows how parameters differ in two different git revisions/tags.
Without arguments, it shows how they differ in workspace vs. last commit.
Comparing experiments
27
$ dvc params diff
Path Metric Value Change
scores.json auc 0.61314 0.07139
● dvc metrics diff rev_1 rev_2 does the same for metrics.
● dvc plots diff rev_1 rev_2 does the same for plots.
Shared Development Server
28
Francesco Casalegno – DVC
● Disk space optimization
Avoid having 1 cache per user!
● Use DVC as usual
- Each dvc add or dvc run moves
data to the shared external cache!
- Each dvc checkout links required
data to the workspace!
Shared Development Server
29
● See here for implementation details,
but basically it’s not too difficult:
$ mkdir -p path_shared_cache/
$ mv .dvc/cache/* path_shared_cache/
$ dvc cache dir path_shared_cache/
$ dvc config cache.shared group
$ git commit -m "config shared cache"
Conclusions
30
Francesco Casalegno – DVC
Conclusions
31
● DVC is a version control system for large ML data and artifacts.
● DVC integrates with Git through *.dvc and dvc.lock files, to version files and pipelines, respectively.
● DVC repos can work as data registries, i.e. a middleware between cloud storage and ML projects
● To track raw ML data files, use dvc add—e.g. for input dataset.
● To track intermediate or final results of a ML pipeline, use dvc run—e.g. for model weights, dataset.
● Consider using a shared development server with a unified, shared external cache

More Related Content

What's hot

GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...
Weaveworks
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Edureka!
 
Kubernetes networking & Security
Kubernetes networking & SecurityKubernetes networking & Security
Kubernetes networking & Security
Vietnam Open Infrastructure User Group
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Git basics
Git basicsGit basics
Git basics
GHARSALLAH Mohamed
 
CI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cdCI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cd
Billy Yuen
 
Kubernetes PPT.pptx
Kubernetes PPT.pptxKubernetes PPT.pptx
Kubernetes PPT.pptx
ssuser0cc9131
 
Jenkins tutorial
Jenkins tutorialJenkins tutorial
Jenkins tutorial
Mamun Rashid, CCDH
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Gitops: the kubernetes way
Gitops: the kubernetes wayGitops: the kubernetes way
Gitops: the kubernetes way
sparkfabrik
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
chrislusf
 
Git and github 101
Git and github 101Git and github 101
Git and github 101
Senthilkumar Gopal
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...
Edureka!
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
Databricks
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Databricks
 
Git
GitGit
Starting with Git & GitHub
Starting with Git & GitHubStarting with Git & GitHub
Starting with Git & GitHub
Nicolás Tourné
 

What's hot (20)

GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...GitOps - Modern best practices for high velocity app dev using cloud native t...
GitOps - Modern best practices for high velocity app dev using cloud native t...
 
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
Big Data Analytics Tutorial | Big Data Analytics for Beginners | Hadoop Tutor...
 
Kubernetes networking & Security
Kubernetes networking & SecurityKubernetes networking & Security
Kubernetes networking & Security
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Git basics
Git basicsGit basics
Git basics
 
CI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cdCI:CD in Lightspeed with kubernetes and argo cd
CI:CD in Lightspeed with kubernetes and argo cd
 
Kubernetes PPT.pptx
Kubernetes PPT.pptxKubernetes PPT.pptx
Kubernetes PPT.pptx
 
Jenkins tutorial
Jenkins tutorialJenkins tutorial
Jenkins tutorial
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Gitops: the kubernetes way
Gitops: the kubernetes wayGitops: the kubernetes way
Gitops: the kubernetes way
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Git and github 101
Git and github 101Git and github 101
Git and github 101
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...
What is Git | What is GitHub | Git Tutorial | GitHub Tutorial | Devops Tutori...
 
DevOps for Databricks
DevOps for DatabricksDevOps for Databricks
DevOps for Databricks
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Git
GitGit
Git
 
Starting with Git & GitHub
Starting with Git & GitHubStarting with Git & GitHub
Starting with Git & GitHub
 

Similar to DVC - Git-like Data Version Control for Machine Learning projects

22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
Athens Big Data
 
DevOops & How I hacked you DevopsDays DC June 2015
DevOops & How I hacked you DevopsDays DC June 2015DevOops & How I hacked you DevopsDays DC June 2015
DevOops & How I hacked you DevopsDays DC June 2015
Chris Gates
 
LasCon 2014 DevOoops
LasCon 2014 DevOoops LasCon 2014 DevOoops
LasCon 2014 DevOoops
Chris Gates
 
Gradle como alternativa a maven
Gradle como alternativa a mavenGradle como alternativa a maven
Gradle como alternativa a maven
David Gómez García
 
PyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.orgPyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.org
Dmitry Petrov
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin
HanLing Shen
 
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
Leo Lorieri
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
Databricks
 
Perspectives on Docker
Perspectives on DockerPerspectives on Docker
Perspectives on Docker
RightScale
 
CI-CD WITH GITLAB WORKFLOW
CI-CD WITH GITLAB WORKFLOWCI-CD WITH GITLAB WORKFLOW
CI-CD WITH GITLAB WORKFLOW
AddWeb Solution Pvt. Ltd.
 
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
MarcinStachniuk
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to production
Marissa Saunders
 
Docker primer and tips
Docker primer and tipsDocker primer and tips
Docker primer and tips
Samuel Chow
 
Microservices DevOps on Google Cloud Platform
Microservices DevOps on Google Cloud PlatformMicroservices DevOps on Google Cloud Platform
Microservices DevOps on Google Cloud Platform
Sunnyvale
 
Cloud Driven Development: a better workflow, less worries, and more power
Cloud Driven Development: a better workflow, less worries, and more powerCloud Driven Development: a better workflow, less worries, and more power
Cloud Driven Development: a better workflow, less worries, and more power
Marzee Labs
 
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik DornJDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
PROIDEA
 
DCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker ContainersDCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker Containers
Docker, Inc.
 
Developing with-devstack
Developing with-devstackDeveloping with-devstack
Developing with-devstack
Deepak Garg
 
Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3
Velocidex Enterprises
 
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPHands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Dana Luther
 

Similar to DVC - Git-like Data Version Control for Machine Learning projects (20)

22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
DevOops & How I hacked you DevopsDays DC June 2015
DevOops & How I hacked you DevopsDays DC June 2015DevOops & How I hacked you DevopsDays DC June 2015
DevOops & How I hacked you DevopsDays DC June 2015
 
LasCon 2014 DevOoops
LasCon 2014 DevOoops LasCon 2014 DevOoops
LasCon 2014 DevOoops
 
Gradle como alternativa a maven
Gradle como alternativa a mavenGradle como alternativa a maven
Gradle como alternativa a maven
 
PyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.orgPyData Berlin 2018: dvc.org
PyData Berlin 2018: dvc.org
 
[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin[20200720]cloud native develoment - Nelson Lin
[20200720]cloud native develoment - Nelson Lin
 
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
[EXTENDED] Ceph, Docker, Heroku Slugs, CoreOS and Deis Overview
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Perspectives on Docker
Perspectives on DockerPerspectives on Docker
Perspectives on Docker
 
CI-CD WITH GITLAB WORKFLOW
CI-CD WITH GITLAB WORKFLOWCI-CD WITH GITLAB WORKFLOW
CI-CD WITH GITLAB WORKFLOW
 
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
Continuous Delivery w projekcie Open Source - Marcin Stachniuk - DevCrowd 2017
 
Data science workflows: from notebooks to production
Data science workflows: from notebooks to productionData science workflows: from notebooks to production
Data science workflows: from notebooks to production
 
Docker primer and tips
Docker primer and tipsDocker primer and tips
Docker primer and tips
 
Microservices DevOps on Google Cloud Platform
Microservices DevOps on Google Cloud PlatformMicroservices DevOps on Google Cloud Platform
Microservices DevOps on Google Cloud Platform
 
Cloud Driven Development: a better workflow, less worries, and more power
Cloud Driven Development: a better workflow, less worries, and more powerCloud Driven Development: a better workflow, less worries, and more power
Cloud Driven Development: a better workflow, less worries, and more power
 
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik DornJDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
JDD2014: Docker.io - versioned linux containers for JVM devops - Dominik Dorn
 
DCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker ContainersDCEU 18: Developing with Docker Containers
DCEU 18: Developing with Docker Containers
 
Developing with-devstack
Developing with-devstackDeveloping with-devstack
Developing with-devstack
 
Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3
 
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPHands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
 

More from Francesco Casalegno

Ordinal Regression and Machine Learning: Applications, Methods, Metrics
Ordinal Regression and Machine Learning: Applications, Methods, MetricsOrdinal Regression and Machine Learning: Applications, Methods, Metrics
Ordinal Regression and Machine Learning: Applications, Methods, Metrics
Francesco Casalegno
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Francesco Casalegno
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Francesco Casalegno
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
Francesco Casalegno
 
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and BootstrapConfidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Francesco Casalegno
 
Smart Pointers in C++
Smart Pointers in C++Smart Pointers in C++
Smart Pointers in C++
Francesco Casalegno
 
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
Francesco Casalegno
 
C++11: Rvalue References, Move Semantics, Perfect Forwarding
C++11: Rvalue References, Move Semantics, Perfect ForwardingC++11: Rvalue References, Move Semantics, Perfect Forwarding
C++11: Rvalue References, Move Semantics, Perfect Forwarding
Francesco Casalegno
 

More from Francesco Casalegno (8)

Ordinal Regression and Machine Learning: Applications, Methods, Metrics
Ordinal Regression and Machine Learning: Applications, Methods, MetricsOrdinal Regression and Machine Learning: Applications, Methods, Metrics
Ordinal Regression and Machine Learning: Applications, Methods, Metrics
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 
Hyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine LearningHyperparameter Optimization for Machine Learning
Hyperparameter Optimization for Machine Learning
 
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and BootstrapConfidence Intervals––Exact Intervals, Jackknife, and Bootstrap
Confidence Intervals––Exact Intervals, Jackknife, and Bootstrap
 
Smart Pointers in C++
Smart Pointers in C++Smart Pointers in C++
Smart Pointers in C++
 
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
[C++] The Curiously Recurring Template Pattern: Static Polymorphsim and Expre...
 
C++11: Rvalue References, Move Semantics, Perfect Forwarding
C++11: Rvalue References, Move Semantics, Perfect ForwardingC++11: Rvalue References, Move Semantics, Perfect Forwarding
C++11: Rvalue References, Move Semantics, Perfect Forwarding
 

Recently uploaded

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
yuvarajkumar334
 

Recently uploaded (20)

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCAModule 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
Module 1 ppt BIG DATA ANALYTICS_NOTES FOR MCA
 

DVC - Git-like Data Version Control for Machine Learning projects

  • 1. DVC Version Control System for Machine Learning Projects Francesco Casalegno
  • 3. Francesco Casalegno – DVC ● Simple command line Git-like experience. ○ Does not require installing and maintaining any databases. ○ Does not depend on any proprietary online services. ● Management and versioning of datasets and ML models. ○ Data is saved in S3, Google cloud, Azure, SSH server, HDFS, or even local HDD RAID. ● Makes projects reproducible and shareable; answers questions on how a model was built. ● Helps manage experiments with Git tags/branches and metrics tracking. “DVC aims to replace spreadsheet and document sharing tools (such as Excel or Google Docs) which are being used frequently as both knowledge repositories and team ledgers. DVC also replaces both ad-hoc scripts to track, move, and deploy different model versions; as well as ad-hoc data file suffixes and prefixes.” What is DVC?[ref] 3
  • 4. Francesco Casalegno – DVC ● dvc and git ○ git: version code, small files ○ dvc: version data, intermed. results, models ○ dvc uses git, w/o storing file content in repo ● versioning and storing large files ○ dvc save info on data in special .dvc files ○ .dvc files can then be versioned using git ○ actual storage happens w remote storage ○ dvc supports many remote storage types ● dvc main features ○ data versioning ○ data access ○ data pipelines What is DVC? 4
  • 6. Francesco Casalegno – DVC Install 6 $ pip install dvc ● Depending on remote storage you will use, you may want to install specific dependencies. $ pip install dvc[s3] # support Amazon S3 $ pip install dvc[ssh] # support ssh $ pip install dvc[all] # all supports ● Install as a python package.
  • 7. Francesco Casalegno – DVC $ mkdir ml_project & cd ml_project $ git init Initialization ● We must work inside a Git repository. If it does not exist yet, we create and initialize one. 7 $ dvc init $ git status -s A .dvc/.gitignore A .dvc/config A .dvc/plots/confusion.json A .dvc/plots/default.json A .dvc/plots/scatter.json A .dvc/plots/smooth.json A .dvcignore $ git commit -m "Initialize dvc project" ● Initializing a DVC project creates and automatically git add a few important files. Tell Git not to track .dvc/cache and .dvc/tmp TOML file with configurations for - dvc remote storage — name, url - dvc cache – reflink/copy/hardlink, location, ... - ... Tell dvc what not to track (empty for now) Plot templates (visualize & compare metrics)
  • 9. Francesco Casalegno – DVC data ├── train │ ├── dogs # 500 pictures │ └── cats # 500 pictures └── validation ├── dogs # 400 pictures └── cats # 400 pictures ● This folder contains 43 MB of JPG figures organized in a hierarchical fashion. Getting some data ● Let’s download some data to train and validate a “cat VS dog” CNN classifier. We use dvc get, which is like wget to download data/models from a remote dvc repo. 9 $ dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip $ unzip data.zip & rm -f data.zip inflating: data/train/cats/cat.001.jpg ...
  • 10. Francesco Casalegno – DVC Start versioning data ● Tracking data with DVC is very similar to tracking code with git. 10 $ dvc add data/ 100% Add|██████████|1/1 [00:30, 30.51s/file] To track the changes with git, run: git add .gitignore data.dvc $ git add .gitignore data.dvc $ git commit -m "Add first version of data/" $ git tag -a "v1.0" -m "data v1.0, 1000 images" Tell git not to track the data/ directory DVC-generated, contains hash to track data/ : outs: - md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir path: data ● Quite a few things happened when calling dvc add : ○ The hash of the content of data/ was computed and added to a new data.dvc file ○ DVC updates .gitignore to tell Git not to track the content of data/ ○ The physical content of data/ —i.e. the jpg images— has been moved to a cache (by default the cache is located in .dvc/cache/ but using a remote cache is possible!) ○ The files were linked back to the workspace so that it looks like nothing happened (the user can configure the link type to use: hard link, soft link, reflink, copy) → human readable, can be versioned with git!
  • 11. Francesco Casalegno – DVC ● To track the changes in our data with dvc, we follow the same procedure as before. Make changes to tracked data (add) 11 ● Let’s download some more data for our “cat VS dog” dataset. Running dvc diff will confirm that dvc is aware the data has changed! $ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip $ unzip new-labels.zip & rm -f new-labels.zip inflating: data/train/cats/cat.501.jpg ... $ dvc diff Modified: data/ $ dvc add data/ $ git diff data.dvc outs: -- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir +- md5: 21060888834f7220846d1c6f6c04e649.dir path: data $ git commit -am "New version of data/ with more training images" $ git tag -a "v2.0" -m "data v2.0, 2000 images"
  • 12. Francesco Casalegno – DVC ● To fix this mismatch we simply call dvc checkout. This reads the cache and updates the data in the workspace based on the current *.dvc files. Switch between versions 12 ● To switch version, first run git checkout. This affects data.dvc but not workspace files in data/ ! $ git checkout v1.0 $ dvc diff Modified: data/ $ dvc checkout M data/ $ dvc status Data and pipelines are up to date.
  • 14. Francesco Casalegno – DVC $ mkdir -p ~/tmp/dvc_storage $ dvc remote add --default loc_remote ~/tmp/dvc_storage Setting 'loc_remote' as a default remote. $ git add .dvc/config $ git commit -m "Configure remote storage loc_remote" ● Many remote storages are supported (Google Drive, Amazon S3, Google Cloud, SSH, HDFS, HTTP, …) But we (as for Git) nothing prevents us to use a “local remote”! ● A remote storage is for dvc, what a GitHub is for git: ○ push and pull files from your workspace to the remote ○ easy sharing between developers ○ safe backup should you ever do a terrible mistake à la rm -rf * Configure remote storage 14 DVC-generated, contains remote storage config [core] remote = loc_remote ['remote "loc_remote"'] url = /root/tmp/dvc_storage
  • 15. Francesco Casalegno – DVC $ dvc push 1800 files pushed $ rm -rf .dvc/cache data $ dvc pull # update .dvc/cache with contents from remote 1800 files fetched $ dvc checkout # update workspace, linking data from .dvc/cache A data/ ● Now, even if all the data is deleted from our workspace and cache, we can download it with dvc pull. This is pretty much like git pull. ● Running basically dvc push uploads the content of the cache to the remote storage. This is pretty much like git push. Storing, sharing, retrieving from storage 15
  • 16. Francesco Casalegno – DVC $ dvc list https://github.com/iterative/dataset-registry README.md get-started/ tutorial/ ... $ dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip ● When working outside of a DVC project —e.g. in automated ML model deployment— use dvg get ● First, we can explore the content of a DVC repo hosted on a Git server. Access data from storage 16 ● Note. For all these commands we can specify a git revision (sha, branch, or tag) with --rev <commit>. ● When working inside of another DVC project, we want to keep the connection between the projects. In this way, others can know where the data comes from and whether new versions are available. dvc import is like dvc get + dvc add, but the resulting data.dvc also includes a ref to the source repo! $ dvc import https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip $ git add new-labels.zip.dvc .gitignore $ git commit -m "Import data from source"
  • 17. Francesco Casalegno – DVC ● We can build a DVC project dedicated only to tracking and versioning datasets and models. The repository would have all the metadata and history of changes in the different datasets. ● This a data registry, a middleware between ML projects and cloud storage. This introduces quite a few advantages. ○ Reusability — reproduce and organize feature stores with a simple dvc get / import ○ Optimization — track data shared by multiple projects centralized in a single location ○ Data as code — leverage Git workflow such as commits, branching, pull requests, CI/CD … ○ Persistence — a DVC registry-controlled remote storage improves data security ● But versioning large data files for data science is great, is not all DVC can do: DVC data pipelines capture how is data filtered, transformed, and used to train models! Data Registries 17
  • 19. Francesco Casalegno – DVC ● With dvc add we can track large files—this includes files such as: trained models, embeddings, etc. However, we also want to track how such files were generated for reproducibility and better tracking! Motivation The following is an example of a typical ML pipeline. Its structure is a DAG (direct acyclic graph). 19 data.xml train.tsv test.tsv train.pkl test.pkl model.pkl scores.json prc.json Prepare train-test split ◾ seed ◾ split prepare.py Featurize TF-IDF embed ◾ max_feats ◾ n_grams featurize.py Train RandomForest ◾ seed ◾ n_estimators train.py Evaluate PR curve, AUC evaluate.py Stage Name ◾ parameter script input output
  • 20. Francesco Casalegno – DVC → Advantages of Option B 1. outputs are automatically tracked (i.e. saved in .dvc/cache) 2. pipeline stages with parameters names are saved in dvc.yaml 3. deps, params, outs are all hashed and tracked in dvc.lock 4. like a Makefile, can reproduce by dvc run prepare—re-run only if deps changed! ● Option A: run pipeline stages, then track output artifacts with dvc add $ python src/prepare.py data/data.xml $ dvc add data/prepared/train.tsv data/prepared/train.tsv Tracking ML Pipelines 20 $ dvc run -n prepare -p prepare.seed -p prepare.split -d src/prepare.py -d data/data.xml -o data/prepared python src/prepare.py data/data.xml ● Option B: run pipeline stage and track them together with all dependencies with dvc run stage name parameters — read from params.yaml dependencies (including script!) outputs to track prepare: split: 0.20 seed: 20170428 featurize: max_feats: 500 ngrams: 1 ...
  • 21. Francesco Casalegno – DVC $ tree . ├── data │ ├── data.xml │ └── data.xml.dvc ├── params.yaml └── src ├── evaluate.py ├── featurization.py ├── prepare.py ├── requirements.txt └── train.py $ mkdir nlp_project & cd nlp_project $ git init & dvc init & git commit -m "Init dvc repo" ● Let’s create a DVC repo for an NLP project. Example 21 ● Then we download some data + some code to prepare the data and train/evaluate a model $ dvc get https://github.com/iterative/dataset-registry get-started/data.xml -o data/data.xml $ dvc add data/data.xml $ git add data/.gitignore data/data.xml.dvc & git commit -m "Add data, first version" $ wget https://code.dvc.org/get-started/code.zip $ unzip code.zip & rm -f code.zip prepare: split: 0.20 seed: 20170428 featurize: max_features: 500 ngrams: 1 train: seed: 20170428 n_estimators: 50 YAML file with params for all the stages pipeline steps
  • 22. Francesco Casalegno – DVC $ dvc run -n prepare -p prepare.seed -p prepare.split -d src/prepare.py -d data/data.xml -o data/prepared python src/prepare.py data/data.xml $ git add data/.gitignore dvc.yaml dvc.lock ● Let’s run the prepare stage. Example 22 stages: prepare: cmd: python src/prepare.py data/data.xml deps: - data/data.xml - src/prepare.py params: - prepare.seed - prepare.split outs: - data/prepared prepare: cmd: python src/prepare.py data/data.xml deps: - path: data/data.xml md5: a304afb96060aad90176268345e10355 - path: src/prepare.py md5: 285af85d794bb57e5d09ace7209f3519 params: params.yaml: prepare.seed: 20170428 prepare.split: 0.2 outs: - params: data/prepared md5: 20b786b6e6f80e2b3fcf17827ad18597.dir Describe data pipelines, similar to how Makefiles work for building software. Matches the dvc.yaml file. Created and updated by DVC commands like dvc run. Describes latest pipeline state for: 1. track intermediate and final artifacts (like a .dvc file) 2. allow DVC to detect when stage defs or dependencies changed, triggering re-run. ● Note: dependencies and artifacts are automatically tracked, no need to dvc add them!
  • 23. Francesco Casalegno – DVC $ dvc run -n featurize -p featurize.max_features -p featurize.ngrams -d src/featurize.py -d data/prepared -o data/features python src/featurization.py data/prepared data/features $ git add data/.gitignore dvc.yaml dvc.lock ● Then we run the featurize and train stages in the same way. Example 23 $ dvc run -n train -p train.seed -p train.n_estimators -d src/train.py -d data/features -o model.pkl python src/train.py data/features model.pkl $ git add data/.gitignore dvc.yaml dvc.lock
  • 24. Francesco Casalegno – DVC $ dvc run -n evaluate -d src/evaluate.py -d model.pkl -d data/features --metrics-no-cache scores.json --plots-no-cache prc.json python src/evaluate.py model.pkl data/features scores.json prc.json $ git add dvc.yaml dvc.lock ● And finally we run the evaluation stage. Example 24 Declare output metrics file. A special kind of output file (-o), must be JSON and can be used to make comparisons across experiments in a tabular form. E.g. here it contains data for AUC score. The -no-cache prevents DVC to store the file in cache. Declare output plot file. A special kind of output file (-o), must be JSON and can be used to make comparisons across experiments in a plot form. E.g. here it contains data for ROC curve plot. The -no-cache prevents DVC to store the file in cache.
  • 25. Francesco Casalegno – DVC $ dvc dag +-------------------+ | data/data.xml.dvc | +-------------------+ * * * +---------+ | prepare | +---------+ * * * +-----------+ | featurize | +-----------+ ** ** ** * * ** +-------+ * | train | ** +-------+ * ** ** ** ** * * +----------+ | evaluate | +----------+ Plot dependency graphs 25
  • 26. Francesco Casalegno – DVC $ dvc repro train 'data/data.xml.dvc' didn't change, skipping Stage 'prepare' didn't change, skipping Stage 'featurize' didn't change, skipping Stage 'train' didn't change, skipping Data and pipelines are up to date. ● dvc repro regenerate data pipeline results, by restoring the DAG defined by stages listed in dvc.yaml. This compares file hashes with dvc.lock to re-run only if needed. This is like make in software builds. Reproducing Pipelines 26 ● Case 1: nothing changed, re-running pipeline stages is skipped. ● Case 2: a dependency changed, pipeline stages are re-run if needed. $ sed -i -e "s@max_features: 500@max_features: 1500@g" params.yaml $ dvc repro train 'data/data.xml.dvc' didn't change, skipping Stage 'prepare' didn't change, skipping Running stage 'featurize' with command: python src/featurization.py data/prepared data/features Updating lock file 'dvc.lock' Running stage 'train' with command: python src/train.py data/features model.pkl Updating lock file 'dvc.lock' To track the changes with git, run: git add dvc.lock
  • 27. Francesco Casalegno – DVC $ dvc plots diff -x recall -y precision file:///Users/dvc/example-get-started/plots.html $ dvc params diff Path Param Old New params.yaml featurize.max_features 500 1500 ● dvc params diff rev_1 rev_2 shows how parameters differ in two different git revisions/tags. Without arguments, it shows how they differ in workspace vs. last commit. Comparing experiments 27 $ dvc params diff Path Metric Value Change scores.json auc 0.61314 0.07139 ● dvc metrics diff rev_1 rev_2 does the same for metrics. ● dvc plots diff rev_1 rev_2 does the same for plots.
  • 29. Francesco Casalegno – DVC ● Disk space optimization Avoid having 1 cache per user! ● Use DVC as usual - Each dvc add or dvc run moves data to the shared external cache! - Each dvc checkout links required data to the workspace! Shared Development Server 29 ● See here for implementation details, but basically it’s not too difficult: $ mkdir -p path_shared_cache/ $ mv .dvc/cache/* path_shared_cache/ $ dvc cache dir path_shared_cache/ $ dvc config cache.shared group $ git commit -m "config shared cache"
  • 31. Francesco Casalegno – DVC Conclusions 31 ● DVC is a version control system for large ML data and artifacts. ● DVC integrates with Git through *.dvc and dvc.lock files, to version files and pipelines, respectively. ● DVC repos can work as data registries, i.e. a middleware between cloud storage and ML projects ● To track raw ML data files, use dvc add—e.g. for input dataset. ● To track intermediate or final results of a ML pipeline, use dvc run—e.g. for model weights, dataset. ● Consider using a shared development server with a unified, shared external cache