@anandsampat
Version Control for Machine
Learning + AI
Workshop
Stanford
@anandsampat
Before we begin:
datmo.com/get-started
Datmo installation:*
Install VirtualBox and follow along instead
https://docs.datmo.com/guides/using-datmo-on-virtualbox.html
Having Trouble?
@anandsampat
Anand Sampat
Co-founder, Datmo
@anandsampat
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
What is Version Control?
The management of changes to
documents, computer programs, large
web sites, and other collections of
information.
*AKA `Source Control`
“
@anandsampat
https://www.ctl.io/developers/assets/images/blog/scmhistory.png
Version Control Timeline
mercurial
@anandsampat
https://www.ctl.io/developers/assets/images/blog/scmhistory.png
Version Control Timeline
mercurial
@anandsampat
You’ve probably heard of Git.
Git is a version control system for tracking
changes in computer files and
coordinating work on those files among
multiple people. It is primarily used
for source code management in software
development, but it can be used to keep
track of changes in any set of files.
@anandsampat
So, GitHub, right?
(Yes, and no.)
@anandsampat
Git(Hub) Revolutionized
Software Development
@anandsampat
GitHub = SCM + Hosting + Much More
@anandsampat
For developers: For enterprises:
• Self-managed SCM servers
became a thing of the past
• Developers could leverage
industry best practices for their
own personal work
• Community of knowledge
built around a known standard
• Collaboration on Open Source
Software
• Advent of continuous
integration / deployment
• Removed need for external
code issue tracking tool
• Consolidation of code storage
and versioning tool

• Pull Requests, code review,
documentation through
ReadMe
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
QoD’s == Quantitative Oriented Developers
Artificial IntelligenceData Science Machine Learning
@anandsampat
https://blog.datmo.io/demystifying-the-ml-ai-and-data-science-development-
ecosystem-part-1-build-76c6d4911d07
@anandsampat
https://blog.datmo.io/demystifying-the-ml-ai-and-data-science-development-
ecosystem-part-1-build-76c6d4911d07
+ Deployment!

+ Post-Deployment!
(DevOps!)
@anandsampat
It’s time to talk about MLOps
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-
systems.pdf
@anandsampat
MLOps: The Elephant in the Room
https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-
systems.pdf
@anandsampat
ML systems have a special capacity for incurring
technical debt, because they have all of the
maintenance problems of traditional code plus an
additional set of ML-specific issues. This debt may be
difficult to detect because it exists at the system level.
“
— Google (Sculley et. al, 2015)
@anandsampat
Typical methods for paying down code level
technical debt are not sufficient to address
ML-specific technical debt at the system level.
“
— Google (Sculley et. al, 2015)
@anandsampat
http://eng.uber.com/wp-content/uploads/2017/09/image8.png
Here’s where traditional tools fall short
@anandsampat
http://eng.uber.com/wp-content/uploads/2017/09/image8.png
Here’s where traditional tools fall short
@anandsampat
@anandsampat
https://eng.uber.com/michelangelo/
https://code.facebook.com/posts/1072626246134461/
introducing-fblearner-flow-facebook-s-ai-backbone/
@anandsampat
As for everyone else?
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
What is Datmo?
Datmo is a workflow tool for ML, AI,
and Data Science developers. It helps
with managing model version control,
easy environment handling, and
reproducing results through the
power of snapshots.
@anandsampat
What are Datmo Snapshots?
Code
Environment
Configuration
Files*
Metrics
@anandsampat
Why are they important?
Environment
Configuration
Metrics
Datmo Snapshots
Git Commits
Code
Files*
@anandsampat
How will it help?
Datmo leverages containers to quickly
spin up perfectly reproducible
developer environments. It tracks this
environment, along with model
metadata inside of snapshots.
@anandsampat
From a broad perspective:
Make ML Ops and workflows
manageable and simple, not
completely abstracted away.
Reduce the amount of glue code
so that people can have more
robust pipelines.
@anandsampat
From a broad perspective:
Make ML Ops and workflows
manageable and simple, not
completely abstracted away.
Reduce the amount of glue code
so that people can have more
robust pipelines.
@anandsampat
GitHub = SCM + Hosting + More
Datmo = Model Versioning +
Environments + Deployment + More
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
Datmo in today’s example
We’re going to use Datmo to show how we can
quickly iterate on our model and streamline our
workflow.
We’ll go through using snapshots for A/B testing,
saving our tasks, and enabling you all to reproduce
my results/make your own changes to the model.
@anandsampat
Problem:
Multiple Classification of Flower Species
@anandsampat
Dataset: Fisher’s Iris Flower
http://archive.ics.uci.edu/ml/datasets/Iris
@anandsampat
At a glance:
- 4 Features
- 3 Classes
- 150 Rows (50 per class)
@anandsampat
Model Experimentation
@anandsampat
Live Demo
@anandsampat
Workshop Outline
1. Conventional version control
2. The curious case of QoD’s
3. How Datmo bridges the gap
4. Iris dataset model example
5. Reproduce + use the model
@anandsampat
Reproducing the Model
https://datmo.com/signup
Ensure you are signed up on Datmo:
$ [sudo] datmo setup
One time initial setup:
https://datmo.com/settings/integration
Connect Github:
@anandsampat
Fork the model
Fork from Web Platform GUI (top right corner):
https://datmo.com/anands/workshop-iris-classification
@anandsampat
Fetch your model from Datmo
$ datmo clone [YOUR-USERNAME]/workshop-iris-classification
Clone the Datmo Model:
$ cd workshop-iris-classification
Jump into this directory:
@anandsampat
Checkout an existing snapshot
@anandsampat
View all model snapshots
$ datmo snapshot ls
@anandsampat
Checkout to a particular snapshot
$ datmo snapshot checkout --id ______
@anandsampat
Create your own snapshot
@anandsampat
Track Snapshots
https://datmo.com/anands/workshop-iris-classification/snapshots?grid=1
@anandsampat
Run the Task
$ datmo task run “python3 classifier.py”
@anandsampat
Run the Task
$ datmo task run “python3 classifier.py”
We want our Python file to be run
inside of the container. Why?
@anandsampat
Create a Snapshot from Task output
$ datmo snapshot task --id _________
@anandsampat
What just happened?
• Datmo cloned the model from the platform,
bringing all of the necessary resources to local.
• Datmo set your current code to the state of the
desired snapshot.
• Datmo built the environment inside of a container.
• Datmo executed the task inside of the container,
and logged the results.
• Datmo combines the task output files,
environment, code, configs, and metrics into a
snapshot
datmo clone
datmo snapshot
checkout
Command Result
datmo task run
datmo snapshot
task
@anandsampat
1. Traditional Source Control isn’t enough for QoD
(Data Science, ML, and AI)
Key Takeaways
2. Think about ML Ops before you’re “in too deep”
3. In the same way GitHub revolutionized Software
Engineering, Datmo does the same for QoD’s
@anandsampat
Code Available at:
https://datmo.com/anands/workshop-iris-classification
@anandsampat
Full Slides Available at:
https://bit.ly/stanford-version-control
@anandsampat
Going Forward
@anandsampat
2. Learn more about ML and browse more content
at our blog: https://blog.datmo.com
Next Steps
3. Interested in updates? You’ll be signed up for our
weekly newsletter if you signed up today.
4. Stay tuned for our open source library this
month. It’ll be at https://github.com/datmo/datmo
1. Check out example workflows in our docs to
create your own Datmo project here
@anandsampat
Thank You!
@anandsampat
References
@anandsampat
Nuts and Bolts of Source Control:
http://ericsink.com/scm/source_control.html
@anandsampat
2015 NIPS Paper from Google
https://papers.nips.cc/paper/5656-hidden-
technical-debt-in-machine-learning-systems.pdf

Version Control in Machine Learning + AI (Stanford)