SlideShare a Scribd company logo
Greg Gandenberger
He/him/his
ShopRunner
Fast and Reproducible Deep
Learning
Today’s Agenda
1 / Data
2 / Code
3 / Dependencies
4 / Experiments
Example Project:
https://github.com/uptake/autofocus/tree/
v3
1 Data
5/
Problem
Many deep learning projects use large collections of files. Processing
those files and moving them around can take a very long time.
6/
Example
Download images from S3.
7/
Example
Trim metadata.
8/
Example
Resize.
9/
Example
Record whether grayscale or color.
grayscale
color
10/
Problem
- Just loading and saving 100k+ images can take hours.
- You have to repeat the process every time your steps change.
- Some of your data is corrupted.
- It’s hard to keep track of which files have been processed in what
ways.
11/
1. Process files in parallel.
Solution
Thread 1
Thread 2
Thread 3
12/
2. Do all operations in one pass.
Solution
Download, trim, resize
13/
3. Handle errors from corrupted data.
Solution
Success
Handle error
Success
14/
4. Skip existing files when appropriate.
Solution
Skip
Skip
Process
15/
5. Keep organized records.
Solution
Grayscale, skipped
Error
Color, written to ~/images/f0ahu021k.jpg
17/
ShopRunner’s open-source Creevey
library handles all of this junk.
https://github.com/ShopRunner/cree
vey
Don’t worry, we gotchu
19/
1. You specify the number of threads, Creevey manages them.
Solution
Thread 1
Thread 2
Thread 3
20/
2. You specify a sequence of functions, Creevey pipes each file
through it.
Solution
Download, trim, resize
21/
3. Creevey catches the error types you specify.
Solution
Success
Handle error
Success
22/
4. Creevey optionally skips existing files.
Solution
Skip
Skip
Process
23/
5. Creevey returns a run report.
Solution
Grayscale, skipped
Error
Color, written to ~/images/f0ahu021k.jpg
25/
Don’t worry, we gotchu
Creevey also provides predefined pipelines and functions for image
processing:
- Load an image from a URL or file path.
- Write an image to disk.
- Calculate image statistics such as brightness or dhash.
- Transform images e.g. by padding, cropping, resizing, normalizing,
converting to grayscale.
27/
Actually, Not Yet
We would love to get contributions!
- Additional image operations
- Operations for other kinds of data (text, audio, video, etc.)
- More framework capabilities?
Code example: tiny.cc/58dwlz
29/
Run Report Example
30/
Different people like to store data in different places, e.g.:
- Mounted volumes
- Centralized “data” directory
- Repo-specific “data” directory
Problem
31/
Be opinionated.
Unsatisfactory Solution
32/
Put path parameters everywhere.
Unsatisfactory Solution
33/
Better Solution: Use an environment variable!
34/
Use parquet or feather rather than CSV to preserve column types, so
you don’t have to write code like this:
Additional Tips
35/
Install s3fs to enable reading from and writing to S3 directly:
pd.read_parquet('s3://autofocus/lpz_data/labels_2012_2016_2017.parquet')
Additional Tips
2 Code
37/
Problem
Version control systems are designed for code-centric projects with a
central branch that evolves linearly.
38/
Problem
Machine learning projects aren’t like that.
- The central elements are data and experiments, not code.
- Code written for one experiment generally doesn’t replace code
written for previous experiments.
39/
Autofocus-Predict
app
Solution
Autofocus
Creevey
library
Autofocus-Train
Workspace
Step 1: Separate libraries and apps from model development workspace.
40/
Libraries
- Packages that you install and import.
- Fit well into the GitHub paradigm.
- Useful across projects.
- Ideally coherent, but you might need a grab-bag or two.
41/
Apps
- Code that you run directly.
- Example: a Flask app that serves model predictions.
- Fit well into the GitHub paradigm.
{
"bird": 6.235780460883689e-07,
"cat": 9.127776934292342e-07,
"coyote": 2.1184381694183685e-05,
"deer": 3.6601684314518934e-06,
"dog": 1.4745426142326323e-06,
"empty": 0.0026697132270783186,
"human": 1.064212392520858e-05,
"mouse": 4.847318102463305e-09,
"opossum": 9.763967682374641e-05,
"raccoon": 0.9986177682876587,
"rat": 4.3888848111350853e-10,
"squirrel": 1.2888597211713204e-06,
"unknown": 0.0004612557531800121,
}
42/
Options
- Put each library, app, and workspace in a separate repo.
- Put closely related code into a single repo, separating these
components at the directory level.
App
Workspace
43/
Step 2a: If possible, use a platform such as Kaggle that follows a more
domain-appropriate paradigm.
Solution
44/
Step 2b: If necessary, adapt version control systems to the domain.
Put simple instructions for retrieving data front and center in the
README. Provide at least three versions of the data:
- Raw
- Lightly processed (e.g. trimmed and resized)
- Sample lightly processed (<1 GB)
Example: https://github.com/uptake/autofocus/tree/v3#getting-the-
data
Solution
45/
Step 2b: If necessary, adapt version control systems to the domain.
Put the code you used for lightly processing the data in a single
directory.
Example:
https://github.com/uptake/autofocus/blob/master/autofocus/build_dat
aset/lpz_2016_2017/process_raw.py
Solution
46/
Step 2b: If necessary, adapt version control systems to the domain.
Create a directory for model training. Give each set of experiments its
own self-contained subdirectory with its own README,
requirements.txt, Dockerfile, etc.
Example:
https://github.com/uptake/autofocus/tree/v3/autofocus/train_model
Solution
3 Dependencies
48/
Problem
Pinning versions for all requirements is necessary for ensuring that
runs are reproducible, but it makes updating hard and causes you to
miss out on upgrades (including security patches!)
vs.
...
49/
Solution
Use `pip-tools` to get the best of both worlds.
You write setup.py (for libraries) or requirements.in (for apps),
specifying e.g. only your direct dependencies with as little version
pinning as possible:
50/
Solution
`pip-compile` generates a requirements.txt file that pins a complete
and consistent set of exact versions for all direct and indirect
dependencies and notes the sources of indirect dependencies.
...
51/
Solution
`pip-sync` ensures that you have installed exactly the libraries in a
requirements.txt file, e.g. uninstalling other packages that are already
in the environment.
52/
Configure a pre-push hook to confirm that the output of `pip-compile`
matches requirements.txt if requirements.in has changed.
- Pre-push rather than pre-commit because it takes several seconds.
- Only when `requirements.in` has changed because every update
risks introducing breaking changes.
`pip-tools` Tips
53/
If you hit unresolvable dependency conflicts, fall back to specifying
abstract dependencies and writing the output of `pip freeze` to a file.
`pip-tools` Tips
54/
Other Dependency Management Tips
Use `nvidia-docker` with a GPU.
55/
- Pin exact versions of PyTorch or Tensorflow to ensure that you will
be able to load an exported model.
- Install big dependencies early in a docker build to take advantage
of layer caching.
Other Dependency Management Tips
4 Experiments
57/
Problem
It’s hard to keep track of what you have tried and what the results
were.
58/
Solution
Use a tool like MLflow Tracking to log hyperparameters and metrics
automatically.
59/
Solution
Record additional observations in a “lab notebook.”
Summary
61/
Summary
- Use `creevey` to process large collections of files.
- Organize machine learning projects appropriately.
⊿ Separate libraries and apps from model development materials.
⊿ Organize model development materials around data and rather
than code.
⊿ Create separate workspaces for different versions of the model
development code.
- Use `pip-tools` to manage dependencies.
- Use lab notebooks and something like MLflow to track
experiments.
63/
Tonks
ShopRunner will soon be releasing an open-source library for training
models with multiple inputs and/or outputs.
Sail to Sable sweater featuring tiered,
ruffle sleeves. Crew neckline. Long
sleeves.
Pattern: Solid
Sleeve length: Long
64/
Thanks to the ShopRunner Data Science Team for feedback and to our
resident memologist Nathan Cooper Jones for his invaluable
contributions.
Credits
65/
Contact
- Name: Greg Gandenberger
- Company: ShopRunner
- Email: greg@gandenberger.org
- Website: gandenberger.org
- Twitter: @ggandenberger
- GitHub: gsganden
- Autofocus: https://github.com/uptake/autofocus/
- Creevey:
⊿ https://github.com/ShopRunner/creevey
⊿ https://pypi.org/project/creevey/

More Related Content

Similar to Fast and Reproducible Deep Learning

Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
sudhakara st
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
Keval Patel
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Fwdays
 
Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligence
Carlos Toxtli
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverNGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
Scott Shadley, MBA,PMC-III
 
Silicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBSilicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDB
Daniel Coupal
 
PHP North-East - Automated Deployment
PHP North-East - Automated DeploymentPHP North-East - Automated Deployment
PHP North-East - Automated Deployment
Michael Peacock
 
Automated Deployment
Automated DeploymentAutomated Deployment
Automated Deployment
phpne
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
Vipul Divyanshu
 
Keep Your Data Safe in a Containerized Application
Keep Your Data Safe in a Containerized ApplicationKeep Your Data Safe in a Containerized Application
Keep Your Data Safe in a Containerized Application
Hagai Barel
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013
Doug Chang
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Luciano Resende
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
Matteo Moretti
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Codemotion
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Sotrender
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Health Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS SystemHealth Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS System
sjreese
 

Similar to Fast and Reproducible Deep Learning (20)

Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
 
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"
 
Reproducibility in artificial intelligence
Reproducibility in artificial intelligenceReproducibility in artificial intelligence
Reproducibility in artificial intelligence
 
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in VacouverNGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
NGD Systems and Microsoft Keynote Presentation at IPDPS MPP in Vacouver
 
Silicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDBSilicon Valley Code Camp 2014 - Advanced MongoDB
Silicon Valley Code Camp 2014 - Advanced MongoDB
 
PHP North-East - Automated Deployment
PHP North-East - Automated DeploymentPHP North-East - Automated Deployment
PHP North-East - Automated Deployment
 
Automated Deployment
Automated DeploymentAutomated Deployment
Automated Deployment
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
Keep Your Data Safe in a Containerized Application
Keep Your Data Safe in a Containerized ApplicationKeep Your Data Safe in a Containerized Application
Keep Your Data Safe in a Containerized Application
 
Apache bigtopwg7142013
Apache bigtopwg7142013Apache bigtopwg7142013
Apache bigtopwg7142013
 
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
Elyra - a set of AI-centric extensions to JupyterLab Notebooks.
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
 
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
Trenowanie i wdrażanie modeli uczenia maszynowego z wykorzystaniem Google Clo...
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Health Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS SystemHealth Check Your DB2 UDB For Z/OS System
Health Check Your DB2 UDB For Z/OS System
 

Recently uploaded

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
lzdvtmy8
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
cjimenez2581
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 

Recently uploaded (20)

一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
一比一原版格里菲斯大学毕业证(Griffith毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
Building a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdfBuilding a Quantum Computer Neutral Atom.pdf
Building a Quantum Computer Neutral Atom.pdf
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 

Fast and Reproducible Deep Learning

  • 2. Today’s Agenda 1 / Data 2 / Code 3 / Dependencies 4 / Experiments
  • 5. 5/ Problem Many deep learning projects use large collections of files. Processing those files and moving them around can take a very long time.
  • 9. 9/ Example Record whether grayscale or color. grayscale color
  • 10. 10/ Problem - Just loading and saving 100k+ images can take hours. - You have to repeat the process every time your steps change. - Some of your data is corrupted. - It’s hard to keep track of which files have been processed in what ways.
  • 11. 11/ 1. Process files in parallel. Solution Thread 1 Thread 2 Thread 3
  • 12. 12/ 2. Do all operations in one pass. Solution Download, trim, resize
  • 13. 13/ 3. Handle errors from corrupted data. Solution Success Handle error Success
  • 14. 14/ 4. Skip existing files when appropriate. Solution Skip Skip Process
  • 15. 15/ 5. Keep organized records. Solution Grayscale, skipped Error Color, written to ~/images/f0ahu021k.jpg
  • 16.
  • 17. 17/ ShopRunner’s open-source Creevey library handles all of this junk. https://github.com/ShopRunner/cree vey Don’t worry, we gotchu
  • 18.
  • 19. 19/ 1. You specify the number of threads, Creevey manages them. Solution Thread 1 Thread 2 Thread 3
  • 20. 20/ 2. You specify a sequence of functions, Creevey pipes each file through it. Solution Download, trim, resize
  • 21. 21/ 3. Creevey catches the error types you specify. Solution Success Handle error Success
  • 22. 22/ 4. Creevey optionally skips existing files. Solution Skip Skip Process
  • 23. 23/ 5. Creevey returns a run report. Solution Grayscale, skipped Error Color, written to ~/images/f0ahu021k.jpg
  • 24.
  • 25. 25/ Don’t worry, we gotchu Creevey also provides predefined pipelines and functions for image processing: - Load an image from a URL or file path. - Write an image to disk. - Calculate image statistics such as brightness or dhash. - Transform images e.g. by padding, cropping, resizing, normalizing, converting to grayscale.
  • 26.
  • 27. 27/ Actually, Not Yet We would love to get contributions! - Additional image operations - Operations for other kinds of data (text, audio, video, etc.) - More framework capabilities?
  • 30. 30/ Different people like to store data in different places, e.g.: - Mounted volumes - Centralized “data” directory - Repo-specific “data” directory Problem
  • 32. 32/ Put path parameters everywhere. Unsatisfactory Solution
  • 33. 33/ Better Solution: Use an environment variable!
  • 34. 34/ Use parquet or feather rather than CSV to preserve column types, so you don’t have to write code like this: Additional Tips
  • 35. 35/ Install s3fs to enable reading from and writing to S3 directly: pd.read_parquet('s3://autofocus/lpz_data/labels_2012_2016_2017.parquet') Additional Tips
  • 37. 37/ Problem Version control systems are designed for code-centric projects with a central branch that evolves linearly.
  • 38. 38/ Problem Machine learning projects aren’t like that. - The central elements are data and experiments, not code. - Code written for one experiment generally doesn’t replace code written for previous experiments.
  • 40. 40/ Libraries - Packages that you install and import. - Fit well into the GitHub paradigm. - Useful across projects. - Ideally coherent, but you might need a grab-bag or two.
  • 41. 41/ Apps - Code that you run directly. - Example: a Flask app that serves model predictions. - Fit well into the GitHub paradigm. { "bird": 6.235780460883689e-07, "cat": 9.127776934292342e-07, "coyote": 2.1184381694183685e-05, "deer": 3.6601684314518934e-06, "dog": 1.4745426142326323e-06, "empty": 0.0026697132270783186, "human": 1.064212392520858e-05, "mouse": 4.847318102463305e-09, "opossum": 9.763967682374641e-05, "raccoon": 0.9986177682876587, "rat": 4.3888848111350853e-10, "squirrel": 1.2888597211713204e-06, "unknown": 0.0004612557531800121, }
  • 42. 42/ Options - Put each library, app, and workspace in a separate repo. - Put closely related code into a single repo, separating these components at the directory level. App Workspace
  • 43. 43/ Step 2a: If possible, use a platform such as Kaggle that follows a more domain-appropriate paradigm. Solution
  • 44. 44/ Step 2b: If necessary, adapt version control systems to the domain. Put simple instructions for retrieving data front and center in the README. Provide at least three versions of the data: - Raw - Lightly processed (e.g. trimmed and resized) - Sample lightly processed (<1 GB) Example: https://github.com/uptake/autofocus/tree/v3#getting-the- data Solution
  • 45. 45/ Step 2b: If necessary, adapt version control systems to the domain. Put the code you used for lightly processing the data in a single directory. Example: https://github.com/uptake/autofocus/blob/master/autofocus/build_dat aset/lpz_2016_2017/process_raw.py Solution
  • 46. 46/ Step 2b: If necessary, adapt version control systems to the domain. Create a directory for model training. Give each set of experiments its own self-contained subdirectory with its own README, requirements.txt, Dockerfile, etc. Example: https://github.com/uptake/autofocus/tree/v3/autofocus/train_model Solution
  • 48. 48/ Problem Pinning versions for all requirements is necessary for ensuring that runs are reproducible, but it makes updating hard and causes you to miss out on upgrades (including security patches!) vs. ...
  • 49. 49/ Solution Use `pip-tools` to get the best of both worlds. You write setup.py (for libraries) or requirements.in (for apps), specifying e.g. only your direct dependencies with as little version pinning as possible:
  • 50. 50/ Solution `pip-compile` generates a requirements.txt file that pins a complete and consistent set of exact versions for all direct and indirect dependencies and notes the sources of indirect dependencies. ...
  • 51. 51/ Solution `pip-sync` ensures that you have installed exactly the libraries in a requirements.txt file, e.g. uninstalling other packages that are already in the environment.
  • 52. 52/ Configure a pre-push hook to confirm that the output of `pip-compile` matches requirements.txt if requirements.in has changed. - Pre-push rather than pre-commit because it takes several seconds. - Only when `requirements.in` has changed because every update risks introducing breaking changes. `pip-tools` Tips
  • 53. 53/ If you hit unresolvable dependency conflicts, fall back to specifying abstract dependencies and writing the output of `pip freeze` to a file. `pip-tools` Tips
  • 54. 54/ Other Dependency Management Tips Use `nvidia-docker` with a GPU.
  • 55. 55/ - Pin exact versions of PyTorch or Tensorflow to ensure that you will be able to load an exported model. - Install big dependencies early in a docker build to take advantage of layer caching. Other Dependency Management Tips
  • 57. 57/ Problem It’s hard to keep track of what you have tried and what the results were.
  • 58. 58/ Solution Use a tool like MLflow Tracking to log hyperparameters and metrics automatically.
  • 61. 61/ Summary - Use `creevey` to process large collections of files. - Organize machine learning projects appropriately. ⊿ Separate libraries and apps from model development materials. ⊿ Organize model development materials around data and rather than code. ⊿ Create separate workspaces for different versions of the model development code. - Use `pip-tools` to manage dependencies. - Use lab notebooks and something like MLflow to track experiments.
  • 62.
  • 63. 63/ Tonks ShopRunner will soon be releasing an open-source library for training models with multiple inputs and/or outputs. Sail to Sable sweater featuring tiered, ruffle sleeves. Crew neckline. Long sleeves. Pattern: Solid Sleeve length: Long
  • 64. 64/ Thanks to the ShopRunner Data Science Team for feedback and to our resident memologist Nathan Cooper Jones for his invaluable contributions. Credits
  • 65. 65/ Contact - Name: Greg Gandenberger - Company: ShopRunner - Email: greg@gandenberger.org - Website: gandenberger.org - Twitter: @ggandenberger - GitHub: gsganden - Autofocus: https://github.com/uptake/autofocus/ - Creevey: ⊿ https://github.com/ShopRunner/creevey ⊿ https://pypi.org/project/creevey/