SlideShare a Scribd company logo
1 of 29
Download to read offline
Data versioning in
machine learning
projects
Dmitry Petrov
dmitry@iterative.ai
PyData Berlin 2018
Agenda
1. What makes ML process special?
2. Data files
3. ML pipelines and reproducibility
4. Data science workflow
5. Beyond the Horizon
1
Dmitry Petrov
Twitter: @FullStackML
PhD in Computer Science
Hello
Co-Founder & CEO at Iterative.AI. San Francisco, USA
ex-Data Scientist at Microsoft (BingAds). Seattle, USA
ex-Head of Lab at St. Petersburg Electrotechnical University, Russia
Chapter I
What makes ML process special?
Data files hell problem
1. Data files are not in your repository.
2. Tons of data file versions:
- model.pkl
- model_L7_e120.pkl
- model_vgg16_L5tune_e120.pkl
- model_L7_e160_cleansed.pkl
- model_vgg16_L45tune_e120.pkl
- model_vgg16_L45tune_e160_noempty.pkl
- …
3. Data files are not connected to code files.
- $ git checkout finetune_head # creates even more mess
Data files hell in a team
1. How to create a reproducible ML project?
2. How to scale ML process in a team:
a. feature extraction
b. a current model tuning
c. experimenting with new models
3. How to pass ML model to deployment or revert a model (to
devops)
Methodology mismatch
“Data science as different from software as software was
different from hardware”
https://dominodatalab.wistia.com/medias/fq0l4152sh
Agile development methodology should cover data science.
Hardware Software Data science/ML
Methodology Waterfall Agile/Scrum Agile/?
What is special about Data Science?
1
New artifacts to manage:
● Experiment: Code + Data files.
● Metrics.
● ML pipelines and reproducibility.
Different process:
● R&D like. A lot of trials and errors, progress should be measured in a
different way.
● Ephemerality. Hard to communicate and track the progress.
* The image from: https://www.customsigns.com/experiment-fail-learn-repeat-poster-sign-18-x-24
DVC project motivation
Open source tool Data Version Control to manage ML projects:
http://dvc.org
DVC is a data science platform on top of open source stack.
GitHub repo: https://github.com/iterative/dvc
Download binaries (Mac, Linux, Windows) or $ pip install dvc
It extends Git by commands: dvc add, dvc run, dvc repro, dvc
remote
● Experiment as commitbranch: Code + Data files.
● Large data files:
○ Local cache.
○ Optimized for 1Gb - 100Gb file size.
○ Data remotes: S3, GCP, SSH.
● Metrics per experiment.
● ML pipelines.
● Reproducibility.
What DVC does?
Chapter II
Data files
Existing solutions
Git-LFS What is required
A single file size < 2Gb 1Gb - 100Gb
Workspace size (all
files)
Slow if 5Gb+ Unlimited
Not garbage collector
for data
20 experiments by 5Gb
each ~= 100Gb
Remove data files from
some of experiments
Data storage Proprietary and paid: only
GitHub and GitLab.
S3, GCP or custom
server (rsync, SFTP)
DVC: add data files
How data file works
Commit data file
Add data remote
DVC: checkout and optimization
Optimizations:
1. No data file copying - hardlinks copy instead.
2. Checksum caching and timesteps tracking.
3. Supports reflinks (CoW - Copy on Write) in modern file systems: BTRFS,
ReFS, XFS.
As a result: 100Gb data file checkout works instantaneously.
Chapter III
ML pipelines and reproducibility
A simple pipeline
Pipeline: images.zip → images/ → model.p → plots.jpg
Specify: input (-d), output (-o) and command.
Reproducibility
DVC reproduces ML pipeline in a single command:
Any DAG (Directed acyclic graph) is supported.
Chapter IV
Data science workflow
Workflow change is needed
Methodology → Workflows → Tools
Git is flexible: you can define your workflow.
master
new_feature
Git workflows: from software to ML
master
new_feature
Gitflow: feature driven
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
Data science flow: why new workflow?
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
Different people can
work on different ideas.
Collaborate without
waiting.
Data science flow: hints
increase_beta
Data science flow: metrics driven
.721
tune_L4
alpha_change
.736 .832
.832.827
.745
master .736
.809
.810
1. Do not forget to create
branches:
$ git checkout master
$ git checkout -b alpha_change master
2. Keep failed experiments:
$ git checkout alpha_change
$ git push master alpha_change
3. Clean up not important
experiments.
Chapter V
Beyond the Horizon
Special DVC scenarios
1. Tracking data files - like Git-LFS but S3GCPSSH backend.
2. ML model deployment tool.
3. Experimentation on HDFS/Apache Spark.
When you need DVC?
DVC is a data science platform on top of open source stack.
It uses some ideas from existing data science platforms but uses
open source stack and Git as a foundation.
Data science platforms helps on creating ML projects in teams
(3+ members).
Thank you!1
Questions
Twitter: @FullStackML
Email: dmitry@iterative.ai
Discuss: discuss.dvc.org
Actions
Visit dvc.org
Star github.com/iterative/dvc

More Related Content

What's hot

Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)Fan Li
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Taro L. Saito
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...InfluxData
 
PGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores Finnoto
PGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores FinnotoPGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores Finnoto
PGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores FinnotoEqunix Business Solutions
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeFlink Forward
 
H2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaH2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaSri Ambati
 
Git and GitHub at the San Francisco JUG
 Git and GitHub at the San Francisco JUG Git and GitHub at the San Francisco JUG
Git and GitHub at the San Francisco JUGMatthew McCullough
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of PrestoTaro L. Saito
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoEqunix Business Solutions
 
OSMC 2009 | NConf - Enterprise Nagios configurator by Angelo Gargiulo
OSMC 2009 | NConf - Enterprise Nagios configurator by Angelo GargiuloOSMC 2009 | NConf - Enterprise Nagios configurator by Angelo Gargiulo
OSMC 2009 | NConf - Enterprise Nagios configurator by Angelo GargiuloNETWAYS
 
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...Equnix Business Solutions
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
GDG London Workshop: Build GCP infrastructure with Terraform
GDG London Workshop: Build GCP infrastructure with Terraform GDG London Workshop: Build GCP infrastructure with Terraform
GDG London Workshop: Build GCP infrastructure with Terraform Pradeep Bhadani
 
H2O 3 REST API Overview
H2O 3 REST API OverviewH2O 3 REST API Overview
H2O 3 REST API OverviewRaymond Peck
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Taro L. Saito
 

What's hot (20)

Migration to drupal 8.
Migration to drupal 8.Migration to drupal 8.
Migration to drupal 8.
 
Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)
 
PyData Introduction
PyData IntroductionPyData Introduction
PyData Introduction
 
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021Unifying Frontend and Backend Development with Scala - ScalaCon 2021
Unifying Frontend and Backend Development with Scala - ScalaCon 2021
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
 
PGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores Finnoto
PGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores FinnotoPGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores Finnoto
PGConf.ASIA 2019 Bali - Patroni on GitLab.com - Jose Cores Finnoto
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
 
H2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaH2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi Mehta
 
Git and GitHub at the San Francisco JUG
 Git and GitHub at the San Francisco JUG Git and GitHub at the San Francisco JUG
Git and GitHub at the San Francisco JUG
 
Reading The Source Code of Presto
Reading The Source Code of PrestoReading The Source Code of Presto
Reading The Source Code of Presto
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
 
OSMC 2009 | NConf - Enterprise Nagios configurator by Angelo Gargiulo
OSMC 2009 | NConf - Enterprise Nagios configurator by Angelo GargiuloOSMC 2009 | NConf - Enterprise Nagios configurator by Angelo Gargiulo
OSMC 2009 | NConf - Enterprise Nagios configurator by Angelo Gargiulo
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
PGConf.ASIA 2019 Bali - How did PostgreSQL Write Load Balancing of Queries Us...
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
GDG London Workshop: Build GCP infrastructure with Terraform
GDG London Workshop: Build GCP infrastructure with Terraform GDG London Workshop: Build GCP infrastructure with Terraform
GDG London Workshop: Build GCP infrastructure with Terraform
 
H2O 3 REST API Overview
H2O 3 REST API OverviewH2O 3 REST API Overview
H2O 3 REST API Overview
 
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
Journey of Migrating 1 Million Presto Queries - Presto Webinar 2020
 

Similar to PyData Berlin 2018: dvc.org

22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...Athens Big Data
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science PlatformQAware GmbH
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Fwdays
 
2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and models2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and modelsGe Org
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchDirk Petersen
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep LearningGreg Gandenberger
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiKim Kao
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFKeith Kraus
 
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...DataScienceConferenc1
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...DataScienceConferenc1
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLDESMOND YUEN
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Databricks
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Josh Levy-Kramer
 

Similar to PyData Berlin 2018: dvc.org (20)

22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
22nd Athens Big Data Meetup - 1st Talk - MLOps Workshop: The Full ML Lifecycl...
 
The Quest for an Open Source Data Science Platform
 The Quest for an Open Source Data Science Platform The Quest for an Open Source Data Science Platform
The Quest for an Open Source Data Science Platform
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and models2018 data engineering for ml asset management for features and models
2018 data engineering for ml asset management for features and models
 
Scientific Computing @ Fred Hutch
Scientific Computing @ Fred HutchScientific Computing @ Fred Hutch
Scientific Computing @ Fred Hutch
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep Learning
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
 
Handout3o
Handout3oHandout3o
Handout3o
 
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDFGPU-Accelerating UDFs in PySpark with Numba and PyGDF
GPU-Accelerating UDFs in PySpark with Numba and PyGDF
 
The DBpedia databus
The DBpedia databusThe DBpedia databus
The DBpedia databus
 
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
[DSC Adria 23] Mikhail Rozhkov DVC in Machine Learning Engineering and MLOps ...
 
Scaling MLOps on NVIDIA DGX Systems
Scaling MLOps on NVIDIA DGX SystemsScaling MLOps on NVIDIA DGX Systems
Scaling MLOps on NVIDIA DGX Systems
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
[DSC Europe 23] Matteo Molteni - Implementing a Robust CI Workflow with dbt f...
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
 
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
Reproducible data science: review of Pachyderm, Data Version Control and GIT ...
 

Recently uploaded

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Recently uploaded (20)

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 

PyData Berlin 2018: dvc.org

  • 1. Data versioning in machine learning projects Dmitry Petrov dmitry@iterative.ai PyData Berlin 2018
  • 2. Agenda 1. What makes ML process special? 2. Data files 3. ML pipelines and reproducibility 4. Data science workflow 5. Beyond the Horizon
  • 3. 1 Dmitry Petrov Twitter: @FullStackML PhD in Computer Science Hello Co-Founder & CEO at Iterative.AI. San Francisco, USA ex-Data Scientist at Microsoft (BingAds). Seattle, USA ex-Head of Lab at St. Petersburg Electrotechnical University, Russia
  • 4. Chapter I What makes ML process special?
  • 5. Data files hell problem 1. Data files are not in your repository. 2. Tons of data file versions: - model.pkl - model_L7_e120.pkl - model_vgg16_L5tune_e120.pkl - model_L7_e160_cleansed.pkl - model_vgg16_L45tune_e120.pkl - model_vgg16_L45tune_e160_noempty.pkl - … 3. Data files are not connected to code files. - $ git checkout finetune_head # creates even more mess
  • 6. Data files hell in a team 1. How to create a reproducible ML project? 2. How to scale ML process in a team: a. feature extraction b. a current model tuning c. experimenting with new models 3. How to pass ML model to deployment or revert a model (to devops)
  • 7. Methodology mismatch “Data science as different from software as software was different from hardware” https://dominodatalab.wistia.com/medias/fq0l4152sh Agile development methodology should cover data science. Hardware Software Data science/ML Methodology Waterfall Agile/Scrum Agile/?
  • 8. What is special about Data Science? 1 New artifacts to manage: ● Experiment: Code + Data files. ● Metrics. ● ML pipelines and reproducibility. Different process: ● R&D like. A lot of trials and errors, progress should be measured in a different way. ● Ephemerality. Hard to communicate and track the progress. * The image from: https://www.customsigns.com/experiment-fail-learn-repeat-poster-sign-18-x-24
  • 9. DVC project motivation Open source tool Data Version Control to manage ML projects: http://dvc.org DVC is a data science platform on top of open source stack. GitHub repo: https://github.com/iterative/dvc Download binaries (Mac, Linux, Windows) or $ pip install dvc It extends Git by commands: dvc add, dvc run, dvc repro, dvc remote
  • 10. ● Experiment as commitbranch: Code + Data files. ● Large data files: ○ Local cache. ○ Optimized for 1Gb - 100Gb file size. ○ Data remotes: S3, GCP, SSH. ● Metrics per experiment. ● ML pipelines. ● Reproducibility. What DVC does?
  • 12. Existing solutions Git-LFS What is required A single file size < 2Gb 1Gb - 100Gb Workspace size (all files) Slow if 5Gb+ Unlimited Not garbage collector for data 20 experiments by 5Gb each ~= 100Gb Remove data files from some of experiments Data storage Proprietary and paid: only GitHub and GitLab. S3, GCP or custom server (rsync, SFTP)
  • 13. DVC: add data files
  • 14. How data file works
  • 17. DVC: checkout and optimization Optimizations: 1. No data file copying - hardlinks copy instead. 2. Checksum caching and timesteps tracking. 3. Supports reflinks (CoW - Copy on Write) in modern file systems: BTRFS, ReFS, XFS. As a result: 100Gb data file checkout works instantaneously.
  • 18. Chapter III ML pipelines and reproducibility
  • 19. A simple pipeline Pipeline: images.zip → images/ → model.p → plots.jpg Specify: input (-d), output (-o) and command.
  • 20. Reproducibility DVC reproduces ML pipeline in a single command: Any DAG (Directed acyclic graph) is supported.
  • 22. Workflow change is needed Methodology → Workflows → Tools Git is flexible: you can define your workflow. master new_feature
  • 23. Git workflows: from software to ML master new_feature Gitflow: feature driven increase_beta Data science flow: metrics driven .721 tune_L4 alpha_change .736 .832 .832.827 .745 master .736 .809 .810
  • 24. Data science flow: why new workflow? increase_beta Data science flow: metrics driven .721 tune_L4 alpha_change .736 .832 .832.827 .745 master .736 .809 .810 Different people can work on different ideas. Collaborate without waiting.
  • 25. Data science flow: hints increase_beta Data science flow: metrics driven .721 tune_L4 alpha_change .736 .832 .832.827 .745 master .736 .809 .810 1. Do not forget to create branches: $ git checkout master $ git checkout -b alpha_change master 2. Keep failed experiments: $ git checkout alpha_change $ git push master alpha_change 3. Clean up not important experiments.
  • 27. Special DVC scenarios 1. Tracking data files - like Git-LFS but S3GCPSSH backend. 2. ML model deployment tool. 3. Experimentation on HDFS/Apache Spark.
  • 28. When you need DVC? DVC is a data science platform on top of open source stack. It uses some ideas from existing data science platforms but uses open source stack and Git as a foundation. Data science platforms helps on creating ML projects in teams (3+ members).
  • 29. Thank you!1 Questions Twitter: @FullStackML Email: dmitry@iterative.ai Discuss: discuss.dvc.org Actions Visit dvc.org Star github.com/iterative/dvc