Fast and Reproducible Deep Learning

Greg Gandenberger
He/him/his
ShopRunner
Fast and Reproducible Deep
Learning

Today’s Agenda
1 / Data
2 / Code
3 / Dependencies
4 / Experiments

Example Project:
https://github.com/uptake/autofocus/tree/
v3

5/
Problem
Many deep learning projects use large collections of files. Processing
those files and moving them around can take a very long time.

6/
Example
Download images from S3.

9/
Example
Record whether grayscale or color.
grayscale
color

10/
Problem
- Just loading and saving 100k+ images can take hours.
- You have to repeat the process every time your steps change.
- Some of your data is corrupted.
- It’s hard to keep track of which files have been processed in what
ways.

11/
1. Process files in parallel.
Solution
Thread 1
Thread 2
Thread 3

12/
2. Do all operations in one pass.
Solution
Download, trim, resize

13/
3. Handle errors from corrupted data.
Solution
Success
Handle error
Success

14/
4. Skip existing files when appropriate.
Solution
Skip
Skip
Process

15/
5. Keep organized records.
Solution
Grayscale, skipped
Error
Color, written to ~/images/f0ahu021k.jpg

17/
ShopRunner’s open-source Creevey
library handles all of this junk.
https://github.com/ShopRunner/cree
vey
Don’t worry, we gotchu

19/
1. You specify the number of threads, Creevey manages them.
Solution
Thread 1
Thread 2
Thread 3

20/
2. You specify a sequence of functions, Creevey pipes each file
through it.
Solution
Download, trim, resize

21/
3. Creevey catches the error types you specify.
Solution
Success
Handle error
Success

22/
4. Creevey optionally skips existing files.
Solution
Skip
Skip
Process

23/
5. Creevey returns a run report.
Solution
Grayscale, skipped
Error
Color, written to ~/images/f0ahu021k.jpg

25/
Don’t worry, we gotchu
Creevey also provides predefined pipelines and functions for image
processing:
- Load an image from a URL or file path.
- Write an image to disk.
- Calculate image statistics such as brightness or dhash.
- Transform images e.g. by padding, cropping, resizing, normalizing,
converting to grayscale.

27/
Actually, Not Yet
We would love to get contributions!
- Additional image operations
- Operations for other kinds of data (text, audio, video, etc.)
- More framework capabilities?

30/
Different people like to store data in different places, e.g.:
- Mounted volumes
- Centralized “data” directory
- Repo-specific “data” directory
Problem

31/
Be opinionated.
Unsatisfactory Solution

32/
Put path parameters everywhere.
Unsatisfactory Solution

33/
Better Solution: Use an environment variable!

34/
Use parquet or feather rather than CSV to preserve column types, so
you don’t have to write code like this:
Additional Tips

35/
Install s3fs to enable reading from and writing to S3 directly:
pd.read_parquet('s3://autofocus/lpz_data/labels_2012_2016_2017.parquet')
Additional Tips

37/
Problem
Version control systems are designed for code-centric projects with a
central branch that evolves linearly.

38/
Problem
Machine learning projects aren’t like that.
- The central elements are data and experiments, not code.
- Code written for one experiment generally doesn’t replace code
written for previous experiments.

39/
Autofocus-Predict
app
Solution
Autofocus
Creevey
library
Autofocus-Train
Workspace
Step 1: Separate libraries and apps from model development workspace.

40/
Libraries
- Packages that you install and import.
- Fit well into the GitHub paradigm.
- Useful across projects.
- Ideally coherent, but you might need a grab-bag or two.

41/
Apps
- Code that you run directly.
- Example: a Flask app that serves model predictions.
- Fit well into the GitHub paradigm.
{
"bird": 6.235780460883689e-07,
"cat": 9.127776934292342e-07,
"coyote": 2.1184381694183685e-05,
"deer": 3.6601684314518934e-06,
"dog": 1.4745426142326323e-06,
"empty": 0.0026697132270783186,
"human": 1.064212392520858e-05,
"mouse": 4.847318102463305e-09,
"opossum": 9.763967682374641e-05,
"raccoon": 0.9986177682876587,
"rat": 4.3888848111350853e-10,
"squirrel": 1.2888597211713204e-06,
"unknown": 0.0004612557531800121,
}

42/
Options
- Put each library, app, and workspace in a separate repo.
- Put closely related code into a single repo, separating these
components at the directory level.
App
Workspace

43/
Step 2a: If possible, use a platform such as Kaggle that follows a more
domain-appropriate paradigm.
Solution

44/
Step 2b: If necessary, adapt version control systems to the domain.
Put simple instructions for retrieving data front and center in the
README. Provide at least three versions of the data:
- Raw
- Lightly processed (e.g. trimmed and resized)
- Sample lightly processed (<1 GB)
Example: https://github.com/uptake/autofocus/tree/v3#getting-the-
data
Solution

45/
Put the code you used for lightly processing the data in a single
directory.
Example:
https://github.com/uptake/autofocus/blob/master/autofocus/build_dat
aset/lpz_2016_2017/process_raw.py
Solution

46/
Create a directory for model training. Give each set of experiments its
own self-contained subdirectory with its own README,
requirements.txt, Dockerfile, etc.
Example:
https://github.com/uptake/autofocus/tree/v3/autofocus/train_model
Solution

48/
Problem
Pinning versions for all requirements is necessary for ensuring that
runs are reproducible, but it makes updating hard and causes you to
miss out on upgrades (including security patches!)
vs.
...

49/
Solution
Use `pip-tools` to get the best of both worlds.
You write setup.py (for libraries) or requirements.in (for apps),
specifying e.g. only your direct dependencies with as little version
pinning as possible:

50/
Solution
`pip-compile` generates a requirements.txt file that pins a complete
and consistent set of exact versions for all direct and indirect
dependencies and notes the sources of indirect dependencies.
...

51/
Solution
`pip-sync` ensures that you have installed exactly the libraries in a
requirements.txt file, e.g. uninstalling other packages that are already
in the environment.

52/
Configure a pre-push hook to confirm that the output of `pip-compile`
matches requirements.txt if requirements.in has changed.
- Pre-push rather than pre-commit because it takes several seconds.
- Only when `requirements.in` has changed because every update
risks introducing breaking changes.
`pip-tools` Tips

53/
If you hit unresolvable dependency conflicts, fall back to specifying
abstract dependencies and writing the output of `pip freeze` to a file.
`pip-tools` Tips

54/
Other Dependency Management Tips
Use `nvidia-docker` with a GPU.

55/
- Pin exact versions of PyTorch or Tensorflow to ensure that you will
be able to load an exported model.
- Install big dependencies early in a docker build to take advantage
of layer caching.
Other Dependency Management Tips

57/
Problem
It’s hard to keep track of what you have tried and what the results
were.

58/
Solution
Use a tool like MLflow Tracking to log hyperparameters and metrics
automatically.

59/
Solution
Record additional observations in a “lab notebook.”

61/
Summary
- Use `creevey` to process large collections of files.
- Organize machine learning projects appropriately.
⊿ Separate libraries and apps from model development materials.
⊿ Organize model development materials around data and rather
than code.
⊿ Create separate workspaces for different versions of the model
development code.
- Use `pip-tools` to manage dependencies.
- Use lab notebooks and something like MLflow to track
experiments.

63/
Tonks
ShopRunner will soon be releasing an open-source library for training
models with multiple inputs and/or outputs.
Sail to Sable sweater featuring tiered,
ruffle sleeves. Crew neckline. Long
sleeves.
Pattern: Solid
Sleeve length: Long

64/
Thanks to the ShopRunner Data Science Team for feedback and to our
resident memologist Nathan Cooper Jones for his invaluable
contributions.
Credits

65/
Contact
- Name: Greg Gandenberger
- Company: ShopRunner
- Email: greg@gandenberger.org
- Website: gandenberger.org
- Twitter: @ggandenberger
- GitHub: gsganden
- Autofocus: https://github.com/uptake/autofocus/
- Creevey:
⊿ https://github.com/ShopRunner/creevey
⊿ https://pypi.org/project/creevey/

Fast and Reproducible Deep Learning

Recommended

Recommended

More Related Content

Similar to Fast and Reproducible Deep Learning

Similar to Fast and Reproducible Deep Learning (20)

Recently uploaded

Recently uploaded (20)

Fast and Reproducible Deep Learning