SlideShare a Scribd company logo
1 of 106
Download to read offline
Full Stack Deep Learning
Josh Tobin, Sergey Karayev, Pieter Abbeel
Infrastructure & Tooling
Full Stack Deep Learning
Monitor predictions and close data
flywheel loop
Test and deploy model
Run experiments, storing results
Provision compute resources
Write and debug model code
2
Provide data
Get optimal prediction system

as scalable API or mobile app
Dream Reality
Aggregate, clean, store, label,

and version data
Infrastructure & Tooling - Overview
Full Stack Deep Learning
• Provide some labeled input-output pairs

• Press a button

• A prediction system that gets 100% accuracy on your data distribution is
live (as an infinitely-scalable API, or running embedded or mobile)
3
SE4ML: Software Engineering for Machine Learning (NIPS 2014
Infrastructure & Tooling - Overview
Full Stack Deep Learning 4Infrastructure & Tooling - Overview
Full Stack Deep Learning 5
Goal: add data, see model improve
Andrej Karpathy at PyTorch Devcon 2019 - https://www.youtube.com/watch?v=oBklltKXtDE
Infrastructure & Tooling - Overview
Full Stack Deep Learning
“All-in-one”
6
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Infrastructure & Tooling - Overview
Web
Versioning
Processing
CI / Testing
?
Resource Management
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Overview
Full Stack Deep Learning
Questions?
8
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning
Programming Language
• Python, because of the libraries

• Clear winner in scientific and data computing
10
Full Stack Deep Learning
Editors
11
VS CodeVim
PyCharmEmacs
Jupyter
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning
Visual Studio Code
• VS Code makes for a very nice Python
experience
12
Built-in git staging and diffing
Open whole projects remotely
Peek documentation
Lint code as your write
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning
Linters and Type Hints
• Whatever code style rules can be codified,
should be

• Static analysis can catch some bugs

• Static type checking both documents code
and catches bugs

• Will see in Lab 7
13Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning
Jupyter Notebooks
• Notebooks have become fundamental to
data science

• Great as the "first draft" of a project

• Jeremy Howard from fast.ai good to 

learn from (course.fast.ai videos)

• Difficult to make scalable, reproducible,
well-tested

• Counter-point: Netflix based all ML
workflows on them
14Infrastructure & Tooling - Software Engineering
https://medium.com/netflix-techblog/notebook-innovation-591ee3221233
Full Stack Deep Learning
Problems with notebooks
• Hard to version

• Notebook "IDE" is primitive

• Very hard to test

• Out-of-order execution artifacts

• Hard to run long or distributed tasks
15
https://towardsdatascience.com/5-reasons-why-jupyter-notebooks-suck-4dc201e27086
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning
Streamlit
• New, but great at fulfilling a common ML need: interactive applets

• Decorate normal Python code

• Smart data caching, quick re-rendering

• In the works: sharing as easy as pushing a web app to Heroku
16
https://streamlit.io
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning
Questions?
17
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Compute
Full Stack Deep Learning
19
Development Training/Evaluation
• Function
• Writing code

• Debugging models

• Looking at results

• Desiderata
• Quickly compile models and run training

• Nice-to-have: use GUI

• Solutions
• Desktop with 1-4 GPUs

• Cloud instance with 1-4 GPUs
• Function
• Model architecture / hyperparam search

• Training large models

• Desiderata
• Easy to launch experiments and review
results

• Solutions
• Desktop with 4 GPUs

• Private cluster of GPU machines

• Cloud cluster of GPU instances
or
or
Infrastructure & Tooling - Compute
Compute needs
Full Stack Deep Learning
Why compute matters
20
https://openai.com/blog/ai-and-compute/
Infrastructure & Tooling - Compute
Full Stack Deep Learning
But creativity matters, too!
21
https://openai.com/blog/ai-and-compute/
Infrastructure & Tooling - Compute
https://www.fast.ai/2018/04/30/dawnbench-fastai/
Full Stack Deep Learning
So, or ?
• GPU Basics

• Cloud Options

• On-prem Options

• Analysis and Recommendations
22Infrastructure & Tooling - Compute
Full Stack Deep Learning
GPU Basics
• NVIDIA has been the only game in town

• Google TPUs are the fastest current option (on GCP)

• Intel Nervana NNPs and AMD could start making sense in the future
23Infrastructure & Tooling - Compute
Full Stack Deep Learning
24
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
GPU Comparison Table
Infrastructure & Tooling - Compute
https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/
Full Stack Deep Learning
25
GPU Comparison Table
Infrastructure & Tooling - Compute
• New NVIDIA architecture every year

• Kepler —> Maxwell —> Pascal —> Volta -> Turing

• Server version first, then “enthusiast”, then consumer.

• For business, only supposed to use server cards.
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
Full Stack Deep Learning
26
GPU Comparison Table
Infrastructure & Tooling - Compute
• RAM: should fit meaningful batches of your model

• Most important for recurrent models
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
Full Stack Deep Learning
27
GPU Comparison Table
Infrastructure & Tooling - Compute
• RAM: should fit meaningful batches of your model

• Most important for recurrent models

• 32bit vs Tensor TFlops
• Tensor Cores are specifically for deep learning operations (mixed precision)

• Good for convolutional/transformer models
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
Full Stack Deep Learning
28
GPU Comparison Table
Infrastructure & Tooling - Compute
• RAM: should fit meaningful batches of your model

• Most important for recurrent models

• 32bit vs Tensor TFlops
• Tensor Cores are specifically for deep learning operations (mixed precision)

• Good for convolutional/transformer models

• Straight 16bit is a bit less good but still better than 32bit

• roughly 2x flops and 1.5x memory
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
Full Stack Deep Learning
29
Kepler/Maxwell
• 2-4x slower than Pascal/Volta

• Hardware: don’t buy, too old

• Cloud: K80's are cheap (providers are stuck with what they bought)
Infrastructure & Tooling - Compute
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
Full Stack Deep Learning
30
Pascal/Volta
• Hardware: 1080 Ti still good if buying used, especially for recurrent

• Cloud: P100 is a mid-range option
Infrastructure & Tooling - Compute
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
Full Stack Deep Learning
31
Turing
• Preferred choice right now due to 16bit mixed precision support

• Hardware: 

• 2080 Ti is ~1.3x as fast as 1080 Ti in 32bit, but ~2x faster in 16bit

• Titan RTX is 10-20% faster yet. Titan V is just as good (but less RAM), if find used.

• Cloud: V100 is the ultimate for speed
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/
Infrastructure & Tooling - Compute
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No used

used
AWS, GCP, MS
Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used

usedP100 2016H1 Pascal Server 16 10 N/A Yes used

used
GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used

usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS
Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used

used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500
RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
Full Stack Deep Learning
32
https://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/
Infrastructure & Tooling - Compute
Full Stack Deep Learning
Cloud Providers
• Amazon Web Services, Google Cloud Platform, Microsoft Azure are the
heavyweights.

• Heavyweights are largely similar in function and price.

• AWS most expensive.

• GCP lets you connect GPUs to any instance, and has TPUs.

• Azure reportedly has bad user experience

• Startups are Paperspace, Lambda Labs
33Infrastructure & Tooling - Compute
Full Stack Deep Learning
34
Amazon Web Services
Name GPU GPUs GPU RAM vCPU RAM On-demand Spot
p3.2xlarge V100 1 16 8 61 $3.06 $1.05
p3.8xlarge V100 4 64 32 244 $12.24 $4.20
p3.16xlarge V100 8 128 64 488 $24.48 $9.64
p2.xlarge K80 1 die 12 4 61 $0.90 $0.43
p2.8xlarge K80 8 dies 96 32 488 $7.20 $3.40
p2.16xlarge K80 16 dies 192 64 768 $14.40 $6.80
Infrastructure & Tooling - Compute
Full Stack Deep Learning
35
Amazon Web Services
Name GPU GPUs GPU RAM vCPU RAM On-demand Spot
p3.2xlarge V100 1 16 8 61 $3.06 $0.95
p3.8xlarge V100 4 64 32 244 $12.24 $4.00
p3.16xlarge V100 8 128 64 488 $24.48 $7.30
p2.xlarge K80 1 die 12 4 61 $0.90 $0.27
p2.8xlarge K80 8 dies 96 32 488 $7.20 $2.16
p2.16xlarge K80 16 dies 192 64 768 $14.40 $4.32
• For a big (50-80%) discount, can get instances that can terminate at any time.

• Makes sense for running hyperparam search experiments, but need
infrastructure to handle failures.
Infrastructure & Tooling - Compute
Full Stack Deep Learning
36
Google Cloud Platform
GPU GPUs GPU RAM On-demand Spot
V100 1-8 16 $2.48 $0.74
P100 1-4 64 $1.46 $0.43
K80 1-8 12 $0.45 $0.14
• GPUs can be attached to any instance.

• In general, a little cheaper than AWS.

• Also has “Tensor Processing Units” (TPUs), the fastest option today.
Infrastructure & Tooling - Compute
Full Stack Deep Learning
37
Microsoft Azure
Name GPU GPUs GPU RAM vCPU RAM On-demand
`nc6 v3` V100 1 16 6 112 $3.06
`nc12 v3` V100 2 32 12 224 $6.12
`nc24 v3` V100 4 64 24 448 $12.24
`nc6 v2` P100 1 16 6 112 $2.07
`nc12 v2` P100 2 32 12 224 $4.14
`nc24 v2` P100 4 64 24 446 $8.28
`nc6` K80 1 24 4 56 $0.90
`nc12` K80 2 48 32 112 $7.20
`nc24` K80 4 96 64 224 $14.40
https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/
• Pretty much equivalent to AWS

• Spot instances are not as well developed as in AWS or GCP.
Infrastructure & Tooling - Compute
Full Stack Deep Learning
On-prem Options
• Build your own

• Up to 4 GPUs is easy and quiet

• Buy pre-built

• Lambda Labs, NVIDIA, and builders like Supermicro, Cirrascale, etc.
38Infrastructure & Tooling - Compute
Full Stack Deep Learning
Building your own
• Quiet PC with 64GB RAM and 4x
RTX 2080 Ti’s: $7000

• One day to build and set up

• Going beyond 4 GPUs is painful

• All you need to know: http://
timdettmers.com/2018/12/16/
deep-learning-hardware-guide/
39Infrastructure & Tooling - Compute
Full Stack Deep Learning
Pre-built: NVIDIA
• NVIDIA DGX Station

• 4x V100, 20-core CPU, 256GB
RAM

• $50,000
40Infrastructure & Tooling - Compute
Full Stack Deep Learning
Pre-built: Lambda Labs
41
RTX 6000 ==Titan RTX
with better cooling
Infrastructure & Tooling - Compute
~20% more expensive

than building yourself
Full Stack Deep Learning
Pre-built: Lambda Labs
42Infrastructure & Tooling - Compute
Full Stack Deep Learning
Cost Analysis
• Let’s first compare on-prem and equivalent cloud machines

• Then let’s also consider spot instances for experiment scaling
43Infrastructure & Tooling - Compute
Full Stack Deep Learning
Quad PC vs Quad Cloud
Verdict: not worth it. PC pays for itself in 5-10 weeks.
44
GPU Arch RAM Build Price Cloud Price
Hours =
Build
24/7 Weeks =
Build
16/5 Weeks =
Build
4x RTX
2080 Ti
Volta 12 $10000.00 0
4x V100 Volta 16 $12.00 833 5 10
Full-time workload
Work week load
Infrastructure & Tooling - Compute
Full Stack Deep Learning
Quad PC vs Quad Cloud
45
https://l7.curtisnorthcutt.com/build-pro-deep-learning-workstation
Full Stack Deep Learning
Quad PC vs. Spot Instances
How to think about it: cloud enables quicker experiments.
46
hours
hours
Length of trial in experiment (hours) 6
Number of trials in experiment 16
Total GPU hours for experiment 96
Cost of 4x RTX 2080 Ti machine $10,000.00
Time to run experiment on 4x machine 24
Time to run experiment on V100 spot instances 6
Cost of provisioning enough pre-emptible V100s $96.00
Number of experiments that equal cost of 4x
machine
104
Infrastructure & Tooling - Compute
Full Stack Deep Learning
Quad PC vs. Spot Instances
But at a pretty steep price.
47
hours
hours
Length of trial in experiment (hours) 6
Number of trials in experiment 16
Total GPU hours for experiment 96
Cost of 4x RTX 2080 Ti machine $10,000.00
Time to run experiment on 4x machine 24
Time to run experiment on V100 spot instances 6
Cost of provisioning enough pre-emptible V100s $96.00
Number of experiments that equal cost of 4x
machine
104
Infrastructure & Tooling - Compute
Full Stack Deep Learning
In Practice
• Even though cloud is expensive, it's hard to make on-prem scale past a
certain point

• Dev-ops (declarative infra, repeatable processe) definitely easier in the
cloud
48Infrastructure & Tooling - Compute
Full Stack Deep Learning
Recommendation for solo/startup
• Development

• Build or Buy a 4x Turing-architecture PC

• Training/Evaluation

• Use the same 4x GPU PC until architecture is dialed in
• When running many experiments, either buy shared server machines or
use cloud instances.
49Infrastructure & Tooling - Compute
Full Stack Deep Learning
Recommendation for larger company
• Development

• Buy a 4x Turing-architecture PC per ML scientist

• Or, let them use V100 instances

• Training/Evaluation
• Use cloud instances with proper provisioning and handling of failures
50Infrastructure & Tooling - Compute
Full Stack Deep Learning
Questions?
51
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning
53
Resource Management
• Function
• Multiple people…

• using multiple GPUs/machines…

• running different environments

• Goal
• Easy to launch a batch of experiments, with proper dependencies and resource allocations

• Solutions
• Spreadsheet

• Python scripts

• SLURM

• Docker + Kubernetes

• Software specialized for ML use cases
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning
Spreadsheets
• People “reserve” what
resources they need to use

• Awful, but still surprisingly
common
54Infrastructure & Tooling - Resource Management
Full Stack Deep Learning 55
• Problem we’re solving: allocate free
resources to programs

• Can be scripted pretty easily (see lab)

• Even better, use old-school cluster job
scheduler

• Job defines necessary resources, gets
queued
Scripts or
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning 56
• Docker is a way to package up an entire
dependency stack in a lighter-than-a-VM
package

• We will talk more about Docker in the
Deployment lecture tomorrow, and use it
in lab
Docker + Kubernetes
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning 57
• Kubernetes is a way to run many docker
containers on top of a cluster

• (Is actually what you are doing the labs
in: https://github.com/jupyterhub/zero-
to-jupyterhub-k8s)
Docker + Kubernetes
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning
• Open source project from Google

• Spawn and manage Jupyter
notebooks

• Manage multi-step ML workflows

• Plug-ins for hyperparameter tuning,
model deployment
58
https://github.com/Langhalsdino/Kubernetes-GPU-Guide
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning 59
Open Source project

with paid features
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning
Questions?
60
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Frameworks
Full Stack Deep Learning
62
Deep Learning Frameworks
• Unless you have a good reason not to, use
Tensorflow/Keras or PyTorch

• Both are converging to the same ideal point:

• easy development via define-by-run

• multi-platform optimized execution graph

• Today, most new projects use PyTorch

• For most problems, fast.ai library worth starting
with for immediate best practices
Goodforproduction
Good for development
Tensorflow 2.0
+ eager execution
+ TorchScript
PyTorch 1.0
Infrastructure & Tooling - Frameworks
2013
2015
2015
2017
2018
Full Stack Deep Learning
PyTorch dominates new development
63
https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/
Infrastructure & Tooling - Frameworks
Full Stack Deep Learning
64
Tensorflow/Keras still lead job posts
Google searches (2018) Job posts (2018)
https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a
Infrastructure & Tooling - Frameworks
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning
Distributed Training
• Using multiple GPUs and/or machines to train a single model.

• More complex than simply running different experiments on different
GPUs

• Makes sense to reduce iteration time, especially on big datasets and large
models
66Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning
Data Parallelism
• If iteration time is too long, try training in data
parallel regime

• "For convolution, expect 1.9x/3.5x speedup
for 2/4 GPUs.

http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
67
http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf
https://lambdalabs.com/blog/titan-v-deep-learning-benchmarks/
This is average over

different convnets
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning
Model Parallelism
68
http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf
• Model parallelism is necessary when
model does not fit on a single GPU

• Introduces a lot of complexity and is
usually not worth it

• Better to buy the largest GPU you can
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning
Data-parallel Tensorflow
• Can be quite easy:
69Infrastructure & Tooling - Distributed Training
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu',
input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(),
Full Stack Deep Learning
Data-parallel PyTorch
• Can be quite easy:
70Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning
Data-parallel PyTorch
• If want to spread over multiple machines, or make it work with model-parallel,
gets more complicated
71Infrastructure & Tooling - Distributed Training
In training code In consoles of the workers
Full Stack Deep Learning
Ray
• Ray is open-source project
for effortless, stateful
distributed computing in
Python
72
https://ray.readthedocs.io/en/latest/walkthrough.html
Full Stack Deep Learning
Horovod
• Distributed training framework for Tensorflow, Keras, and PyTorch

• Uses MPI (standard multi-process communication framework) instead of
Tensorflow parameter servers

• Could be an easier experience for multi-node training
73Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning
Questions?
74
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning
Experiment Management
• Even running one experiment at a time, can lose track of which code,
parameters, and dataset generated which trained model.

• When running multiple experiments, problem is much worse.
76Infrastructure & Tooling - Experiment Management
https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb
Full Stack Deep Learning
77
Tensorboard
• A fine solution for single
experiments

• Gets unwieldy to manage many
experiments, and to properly
store past work
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning
78
Losswise
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning
79
Comet.ml
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning
80
Weights & Biases
• What we use in Lab 3
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning
81
MLFlow tracking
Infrastructure & Tooling - Experiment Management
https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb
• Self-hosted solution
from DataBricks
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning
Hyperparameter Optimization
• Useful to have software that helps you search over hyper parameter
settings.

• Could be as simple as being able to provide `—lr=(0.0001, 0.1) --
num_layers=[128, 256, 512]` to training script
83Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning 84Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning 85Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning
Ray - Tune
• "Choose among scalable SOTA
algorithms such as Population
Based Training (PBT), Vizier’s
Median Stopping Rule,
HyperBand/ASHA."

• These redirect compute resources
toward promising areas of search
space
86Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning 87
Can also visually inspect hyperparam results
Infrastructure & Tooling - Hyperparameter Tuning
• We will also see a solution from
W&B in Lab 4
Full Stack Deep Learning
Questions?
88
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning
All-in-one Solutions
• Single system for everything

• development (hosted notebook)

• scaling experiments to many machines (sometimes even provisioning)

• tracking experiments and versioning models

• deploying models

• monitoring performance
90Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 91Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 92Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 93Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 94Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 95Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 96
https://www.slideshare.net/AmazonWebServices/build-train-and-deploy-machine-learning-models-at-scale
Infrastructure & Tooling - All-in-one
40% markup over corresponding
EC2 instances
Full Stack Deep Learning 97Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 98Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 99Infrastructure & Tooling - All-in-one
Full Stack Deep Learning 100Infrastructure & Tooling - All-in-one
• HyperBand based
hyperparameter tuning

• In-house distributed training
module
Full Stack Deep Learning
Domino Data Lab
101
Provision compute
Track experiments
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning
102
Domino Data Lab
Deploy REST API
Monitor predictions
Publish applets
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning
103
Domino Data Lab
Monitor spend
All projects in one place
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning
104
Neptune
Paperspace
Gradient
W&B Comet.ml Floyd Algorithmia Determined.ai
Amazon
SageMaker
GC ML
Engine
Domino Data
Lab
Hardware
GCP or
agnostic
Paperspace N/A N/A GCP N/A Agnostic AWS GCP Agnostic
Resource Management Yes Yes No No Yes No
Yes*

(but not
provisioning)
Yes Yes Yes
Hyperparam
Optimization
Yes No Yes Yes No No Yes Yes Yes Yes
Storing Models Yes Yes Yes Yes Yes No Yes Yes Yes Yes
Reviewing Experiments Yes Yes Yes Yes Yes No Yes No Yes Yes
Deploying Models as
REST API
No No No No Yes Yes Yes Yes Yes Yes
Monitoring No No No No No Yes No Yes Yes Yes
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning
Questions?
105
Full Stack Deep Learning
Thank you!
106

More Related Content

What's hot

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...Bill Liu
 
Accelerate Your AI Today
Accelerate Your AI TodayAccelerate Your AI Today
Accelerate Your AI TodayDESMOND YUEN
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsGianmario Spacagna
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in productionStepan Pushkarev
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Databricks
 
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Databricks
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...MLconf
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning InfrastructureSigOpt
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDatabricks
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
 

What's hot (20)

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
 
Ai use cases
Ai use casesAi use cases
Ai use cases
 
Accelerate Your AI Today
Accelerate Your AI TodayAccelerate Your AI Today
Accelerate Your AI Today
 
Managers guide to effective building of machine learning products
Managers guide to effective building of machine learning productsManagers guide to effective building of machine learning products
Managers guide to effective building of machine learning products
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Predictive Model and Record Description with Segmented Sensitivity Analysis (...
Predictive Model and Record Description with Segmented Sensitivity Analysis (...
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
 
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
Deploying and Monitoring Heterogeneous Machine Learning Applications with Cli...
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
COBOL to Apache Spark
COBOL to Apache SparkCOBOL to Apache Spark
COBOL to Apache Spark
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Drifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in ProductionDrifting Away: Testing ML Models in Production
Drifting Away: Testing ML Models in Production
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 

Similar to Infrastructure and Tooling - Full Stack Deep Learning

Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
 
S5429_LanceBrown
S5429_LanceBrownS5429_LanceBrown
S5429_LanceBrownLance Brown
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkRed Hat Developers
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraJim Dowling
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Lablup Inc.
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusDatabricks
 
Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesKTN
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil GovindanNewton Alex
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...Equnix Business Solutions
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platformst_ivanov
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLMohammad Sabouri
 

Similar to Infrastructure and Tooling - Full Stack Deep Learning (20)

Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
 
S5429_LanceBrown
S5429_LanceBrownS5429_LanceBrown
S5429_LanceBrown
 
GPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech TalkGPU enablement for data science on OpenShift | DevNation Tech Talk
GPU enablement for data science on OpenShift | DevNation Tech Talk
 
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa ClaraScaling TensorFlow with Hops, Global AI Conference Santa Clara
Scaling TensorFlow with Hops, Global AI Conference Santa Clara
 
06 u 2
06 u 206 u 2
06 u 2
 
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)Backend.AI Technical Introduction (19.09 / 2019 Autumn)
Backend.AI Technical Introduction (19.09 / 2019 Autumn)
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
Implementing AI: High Performace Architectures
Implementing AI: High Performace ArchitecturesImplementing AI: High Performace Architectures
Implementing AI: High Performace Architectures
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
[Hadoop Meetup] Tensorflow on Apache Hadoop YARN - Sunil Govindan
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
PGConf.ASIA 2019 Bali - Full-throttle Running on Terabytes Log-data - Kohei K...
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Labview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRLLabview1_ Computer Applications in Control_ACRRL
Labview1_ Computer Applications in Control_ACRRL
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 

More from Sergey Karayev

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019Sergey Karayev
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Sergey Karayev
 

More from Sergey Karayev (13)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 

Recently uploaded

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 

Recently uploaded (20)

Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 

Infrastructure and Tooling - Full Stack Deep Learning

  • 1. Full Stack Deep Learning Josh Tobin, Sergey Karayev, Pieter Abbeel Infrastructure & Tooling
  • 2. Full Stack Deep Learning Monitor predictions and close data flywheel loop Test and deploy model Run experiments, storing results Provision compute resources Write and debug model code 2 Provide data Get optimal prediction system
 as scalable API or mobile app Dream Reality Aggregate, clean, store, label,
 and version data Infrastructure & Tooling - Overview
  • 3. Full Stack Deep Learning • Provide some labeled input-output pairs • Press a button • A prediction system that gets 100% accuracy on your data distribution is live (as an infinitely-scalable API, or running embedded or mobile) 3 SE4ML: Software Engineering for Machine Learning (NIPS 2014 Infrastructure & Tooling - Overview
  • 4. Full Stack Deep Learning 4Infrastructure & Tooling - Overview
  • 5. Full Stack Deep Learning 5 Goal: add data, see model improve Andrej Karpathy at PyTorch Devcon 2019 - https://www.youtube.com/watch?v=oBklltKXtDE Infrastructure & Tooling - Overview
  • 6. Full Stack Deep Learning “All-in-one” 6 Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Infrastructure & Tooling - Overview Web Versioning Processing CI / Testing ? Resource Management
  • 7. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Overview
  • 8. Full Stack Deep Learning Questions? 8
  • 9. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Software Engineering
  • 10. Full Stack Deep Learning Programming Language • Python, because of the libraries • Clear winner in scientific and data computing 10
  • 11. Full Stack Deep Learning Editors 11 VS CodeVim PyCharmEmacs Jupyter Infrastructure & Tooling - Software Engineering
  • 12. Full Stack Deep Learning Visual Studio Code • VS Code makes for a very nice Python experience 12 Built-in git staging and diffing Open whole projects remotely Peek documentation Lint code as your write Infrastructure & Tooling - Software Engineering
  • 13. Full Stack Deep Learning Linters and Type Hints • Whatever code style rules can be codified, should be • Static analysis can catch some bugs • Static type checking both documents code and catches bugs • Will see in Lab 7 13Infrastructure & Tooling - Software Engineering
  • 14. Full Stack Deep Learning Jupyter Notebooks • Notebooks have become fundamental to data science • Great as the "first draft" of a project • Jeremy Howard from fast.ai good to 
 learn from (course.fast.ai videos) • Difficult to make scalable, reproducible, well-tested • Counter-point: Netflix based all ML workflows on them 14Infrastructure & Tooling - Software Engineering https://medium.com/netflix-techblog/notebook-innovation-591ee3221233
  • 15. Full Stack Deep Learning Problems with notebooks • Hard to version • Notebook "IDE" is primitive • Very hard to test • Out-of-order execution artifacts • Hard to run long or distributed tasks 15 https://towardsdatascience.com/5-reasons-why-jupyter-notebooks-suck-4dc201e27086 Infrastructure & Tooling - Software Engineering
  • 16. Full Stack Deep Learning Streamlit • New, but great at fulfilling a common ML need: interactive applets • Decorate normal Python code • Smart data caching, quick re-rendering • In the works: sharing as easy as pushing a web app to Heroku 16 https://streamlit.io Infrastructure & Tooling - Software Engineering
  • 17. Full Stack Deep Learning Questions? 17
  • 18. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Compute
  • 19. Full Stack Deep Learning 19 Development Training/Evaluation • Function • Writing code • Debugging models • Looking at results • Desiderata • Quickly compile models and run training • Nice-to-have: use GUI • Solutions • Desktop with 1-4 GPUs • Cloud instance with 1-4 GPUs • Function • Model architecture / hyperparam search • Training large models • Desiderata • Easy to launch experiments and review results • Solutions • Desktop with 4 GPUs • Private cluster of GPU machines • Cloud cluster of GPU instances or or Infrastructure & Tooling - Compute Compute needs
  • 20. Full Stack Deep Learning Why compute matters 20 https://openai.com/blog/ai-and-compute/ Infrastructure & Tooling - Compute
  • 21. Full Stack Deep Learning But creativity matters, too! 21 https://openai.com/blog/ai-and-compute/ Infrastructure & Tooling - Compute https://www.fast.ai/2018/04/30/dawnbench-fastai/
  • 22. Full Stack Deep Learning So, or ? • GPU Basics • Cloud Options • On-prem Options • Analysis and Recommendations 22Infrastructure & Tooling - Compute
  • 23. Full Stack Deep Learning GPU Basics • NVIDIA has been the only game in town • Google TPUs are the fastest current option (on GCP) • Intel Nervana NNPs and AMD could start making sense in the future 23Infrastructure & Tooling - Compute
  • 24. Full Stack Deep Learning 24 Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500 GPU Comparison Table Infrastructure & Tooling - Compute https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/
  • 25. Full Stack Deep Learning 25 GPU Comparison Table Infrastructure & Tooling - Compute • New NVIDIA architecture every year • Kepler —> Maxwell —> Pascal —> Volta -> Turing • Server version first, then “enthusiast”, then consumer. • For business, only supposed to use server cards. Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
  • 26. Full Stack Deep Learning 26 GPU Comparison Table Infrastructure & Tooling - Compute • RAM: should fit meaningful batches of your model • Most important for recurrent models http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
  • 27. Full Stack Deep Learning 27 GPU Comparison Table Infrastructure & Tooling - Compute • RAM: should fit meaningful batches of your model • Most important for recurrent models • 32bit vs Tensor TFlops • Tensor Cores are specifically for deep learning operations (mixed precision) • Good for convolutional/transformer models http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
  • 28. Full Stack Deep Learning 28 GPU Comparison Table Infrastructure & Tooling - Compute • RAM: should fit meaningful batches of your model • Most important for recurrent models • 32bit vs Tensor TFlops • Tensor Cores are specifically for deep learning operations (mixed precision) • Good for convolutional/transformer models • Straight 16bit is a bit less good but still better than 32bit • roughly 2x flops and 1.5x memory http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
  • 29. Full Stack Deep Learning 29 Kepler/Maxwell • 2-4x slower than Pascal/Volta • Hardware: don’t buy, too old • Cloud: K80's are cheap (providers are stuck with what they bought) Infrastructure & Tooling - Compute Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
  • 30. Full Stack Deep Learning 30 Pascal/Volta • Hardware: 1080 Ti still good if buying used, especially for recurrent • Cloud: P100 is a mid-range option Infrastructure & Tooling - Compute Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
  • 31. Full Stack Deep Learning 31 Turing • Preferred choice right now due to 16bit mixed precision support • Hardware: • 2080 Ti is ~1.3x as fast as 1080 Ti in 32bit, but ~2x faster in 16bit • Titan RTX is 10-20% faster yet. Titan V is just as good (but less RAM), if find used. • Cloud: V100 is the ultimate for speed http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/ Infrastructure & Tooling - Compute Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No used used AWS, GCP, MS Titan X 2015H1 Maxwell Enthusiast 12 6 N/A No used usedP100 2016H1 Pascal Server 16 10 N/A Yes used used GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No used usedV100 2017H1 Volta Server 16 14 120 Yes $10000 AWS, GCP, MS Titan V 2017H2 Volta Enthusiast 12 14 110 Yes used used2080 Ti 2018H2 Turing Consumer 11 13 60 Yes $1000 Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $2500 RTX 8000 2018H2 Turing Enthusiast 48 16 160 Yes $5500
  • 32. Full Stack Deep Learning 32 https://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/ Infrastructure & Tooling - Compute
  • 33. Full Stack Deep Learning Cloud Providers • Amazon Web Services, Google Cloud Platform, Microsoft Azure are the heavyweights. • Heavyweights are largely similar in function and price. • AWS most expensive. • GCP lets you connect GPUs to any instance, and has TPUs. • Azure reportedly has bad user experience • Startups are Paperspace, Lambda Labs 33Infrastructure & Tooling - Compute
  • 34. Full Stack Deep Learning 34 Amazon Web Services Name GPU GPUs GPU RAM vCPU RAM On-demand Spot p3.2xlarge V100 1 16 8 61 $3.06 $1.05 p3.8xlarge V100 4 64 32 244 $12.24 $4.20 p3.16xlarge V100 8 128 64 488 $24.48 $9.64 p2.xlarge K80 1 die 12 4 61 $0.90 $0.43 p2.8xlarge K80 8 dies 96 32 488 $7.20 $3.40 p2.16xlarge K80 16 dies 192 64 768 $14.40 $6.80 Infrastructure & Tooling - Compute
  • 35. Full Stack Deep Learning 35 Amazon Web Services Name GPU GPUs GPU RAM vCPU RAM On-demand Spot p3.2xlarge V100 1 16 8 61 $3.06 $0.95 p3.8xlarge V100 4 64 32 244 $12.24 $4.00 p3.16xlarge V100 8 128 64 488 $24.48 $7.30 p2.xlarge K80 1 die 12 4 61 $0.90 $0.27 p2.8xlarge K80 8 dies 96 32 488 $7.20 $2.16 p2.16xlarge K80 16 dies 192 64 768 $14.40 $4.32 • For a big (50-80%) discount, can get instances that can terminate at any time. • Makes sense for running hyperparam search experiments, but need infrastructure to handle failures. Infrastructure & Tooling - Compute
  • 36. Full Stack Deep Learning 36 Google Cloud Platform GPU GPUs GPU RAM On-demand Spot V100 1-8 16 $2.48 $0.74 P100 1-4 64 $1.46 $0.43 K80 1-8 12 $0.45 $0.14 • GPUs can be attached to any instance. • In general, a little cheaper than AWS. • Also has “Tensor Processing Units” (TPUs), the fastest option today. Infrastructure & Tooling - Compute
  • 37. Full Stack Deep Learning 37 Microsoft Azure Name GPU GPUs GPU RAM vCPU RAM On-demand `nc6 v3` V100 1 16 6 112 $3.06 `nc12 v3` V100 2 32 12 224 $6.12 `nc24 v3` V100 4 64 24 448 $12.24 `nc6 v2` P100 1 16 6 112 $2.07 `nc12 v2` P100 2 32 12 224 $4.14 `nc24 v2` P100 4 64 24 446 $8.28 `nc6` K80 1 24 4 56 $0.90 `nc12` K80 2 48 32 112 $7.20 `nc24` K80 4 96 64 224 $14.40 https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/ • Pretty much equivalent to AWS • Spot instances are not as well developed as in AWS or GCP. Infrastructure & Tooling - Compute
  • 38. Full Stack Deep Learning On-prem Options • Build your own • Up to 4 GPUs is easy and quiet • Buy pre-built • Lambda Labs, NVIDIA, and builders like Supermicro, Cirrascale, etc. 38Infrastructure & Tooling - Compute
  • 39. Full Stack Deep Learning Building your own • Quiet PC with 64GB RAM and 4x RTX 2080 Ti’s: $7000 • One day to build and set up • Going beyond 4 GPUs is painful • All you need to know: http:// timdettmers.com/2018/12/16/ deep-learning-hardware-guide/ 39Infrastructure & Tooling - Compute
  • 40. Full Stack Deep Learning Pre-built: NVIDIA • NVIDIA DGX Station • 4x V100, 20-core CPU, 256GB RAM • $50,000 40Infrastructure & Tooling - Compute
  • 41. Full Stack Deep Learning Pre-built: Lambda Labs 41 RTX 6000 ==Titan RTX with better cooling Infrastructure & Tooling - Compute ~20% more expensive
 than building yourself
  • 42. Full Stack Deep Learning Pre-built: Lambda Labs 42Infrastructure & Tooling - Compute
  • 43. Full Stack Deep Learning Cost Analysis • Let’s first compare on-prem and equivalent cloud machines • Then let’s also consider spot instances for experiment scaling 43Infrastructure & Tooling - Compute
  • 44. Full Stack Deep Learning Quad PC vs Quad Cloud Verdict: not worth it. PC pays for itself in 5-10 weeks. 44 GPU Arch RAM Build Price Cloud Price Hours = Build 24/7 Weeks = Build 16/5 Weeks = Build 4x RTX 2080 Ti Volta 12 $10000.00 0 4x V100 Volta 16 $12.00 833 5 10 Full-time workload Work week load Infrastructure & Tooling - Compute
  • 45. Full Stack Deep Learning Quad PC vs Quad Cloud 45 https://l7.curtisnorthcutt.com/build-pro-deep-learning-workstation
  • 46. Full Stack Deep Learning Quad PC vs. Spot Instances How to think about it: cloud enables quicker experiments. 46 hours hours Length of trial in experiment (hours) 6 Number of trials in experiment 16 Total GPU hours for experiment 96 Cost of 4x RTX 2080 Ti machine $10,000.00 Time to run experiment on 4x machine 24 Time to run experiment on V100 spot instances 6 Cost of provisioning enough pre-emptible V100s $96.00 Number of experiments that equal cost of 4x machine 104 Infrastructure & Tooling - Compute
  • 47. Full Stack Deep Learning Quad PC vs. Spot Instances But at a pretty steep price. 47 hours hours Length of trial in experiment (hours) 6 Number of trials in experiment 16 Total GPU hours for experiment 96 Cost of 4x RTX 2080 Ti machine $10,000.00 Time to run experiment on 4x machine 24 Time to run experiment on V100 spot instances 6 Cost of provisioning enough pre-emptible V100s $96.00 Number of experiments that equal cost of 4x machine 104 Infrastructure & Tooling - Compute
  • 48. Full Stack Deep Learning In Practice • Even though cloud is expensive, it's hard to make on-prem scale past a certain point • Dev-ops (declarative infra, repeatable processe) definitely easier in the cloud 48Infrastructure & Tooling - Compute
  • 49. Full Stack Deep Learning Recommendation for solo/startup • Development • Build or Buy a 4x Turing-architecture PC • Training/Evaluation • Use the same 4x GPU PC until architecture is dialed in • When running many experiments, either buy shared server machines or use cloud instances. 49Infrastructure & Tooling - Compute
  • 50. Full Stack Deep Learning Recommendation for larger company • Development • Buy a 4x Turing-architecture PC per ML scientist • Or, let them use V100 instances • Training/Evaluation • Use cloud instances with proper provisioning and handling of failures 50Infrastructure & Tooling - Compute
  • 51. Full Stack Deep Learning Questions? 51
  • 52. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Resource Management
  • 53. Full Stack Deep Learning 53 Resource Management • Function • Multiple people… • using multiple GPUs/machines… • running different environments • Goal • Easy to launch a batch of experiments, with proper dependencies and resource allocations • Solutions • Spreadsheet • Python scripts • SLURM • Docker + Kubernetes • Software specialized for ML use cases Infrastructure & Tooling - Resource Management
  • 54. Full Stack Deep Learning Spreadsheets • People “reserve” what resources they need to use • Awful, but still surprisingly common 54Infrastructure & Tooling - Resource Management
  • 55. Full Stack Deep Learning 55 • Problem we’re solving: allocate free resources to programs • Can be scripted pretty easily (see lab) • Even better, use old-school cluster job scheduler • Job defines necessary resources, gets queued Scripts or Infrastructure & Tooling - Resource Management
  • 56. Full Stack Deep Learning 56 • Docker is a way to package up an entire dependency stack in a lighter-than-a-VM package • We will talk more about Docker in the Deployment lecture tomorrow, and use it in lab Docker + Kubernetes Infrastructure & Tooling - Resource Management
  • 57. Full Stack Deep Learning 57 • Kubernetes is a way to run many docker containers on top of a cluster • (Is actually what you are doing the labs in: https://github.com/jupyterhub/zero- to-jupyterhub-k8s) Docker + Kubernetes Infrastructure & Tooling - Resource Management
  • 58. Full Stack Deep Learning • Open source project from Google • Spawn and manage Jupyter notebooks • Manage multi-step ML workflows • Plug-ins for hyperparameter tuning, model deployment 58 https://github.com/Langhalsdino/Kubernetes-GPU-Guide Infrastructure & Tooling - Resource Management
  • 59. Full Stack Deep Learning 59 Open Source project
 with paid features Infrastructure & Tooling - Resource Management
  • 60. Full Stack Deep Learning Questions? 60
  • 61. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Frameworks
  • 62. Full Stack Deep Learning 62 Deep Learning Frameworks • Unless you have a good reason not to, use Tensorflow/Keras or PyTorch • Both are converging to the same ideal point: • easy development via define-by-run • multi-platform optimized execution graph • Today, most new projects use PyTorch • For most problems, fast.ai library worth starting with for immediate best practices Goodforproduction Good for development Tensorflow 2.0 + eager execution + TorchScript PyTorch 1.0 Infrastructure & Tooling - Frameworks 2013 2015 2015 2017 2018
  • 63. Full Stack Deep Learning PyTorch dominates new development 63 https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/ Infrastructure & Tooling - Frameworks
  • 64. Full Stack Deep Learning 64 Tensorflow/Keras still lead job posts Google searches (2018) Job posts (2018) https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a Infrastructure & Tooling - Frameworks
  • 65. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Distributed Training
  • 66. Full Stack Deep Learning Distributed Training • Using multiple GPUs and/or machines to train a single model. • More complex than simply running different experiments on different GPUs • Makes sense to reduce iteration time, especially on big datasets and large models 66Infrastructure & Tooling - Distributed Training
  • 67. Full Stack Deep Learning Data Parallelism • If iteration time is too long, try training in data parallel regime • "For convolution, expect 1.9x/3.5x speedup for 2/4 GPUs.
 http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ 67 http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf https://lambdalabs.com/blog/titan-v-deep-learning-benchmarks/ This is average over
 different convnets Infrastructure & Tooling - Distributed Training
  • 68. Full Stack Deep Learning Model Parallelism 68 http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf • Model parallelism is necessary when model does not fit on a single GPU • Introduces a lot of complexity and is usually not worth it • Better to buy the largest GPU you can Infrastructure & Tooling - Distributed Training
  • 69. Full Stack Deep Learning Data-parallel Tensorflow • Can be quite easy: 69Infrastructure & Tooling - Distributed Training strategy = tf.distribute.MirroredStrategy() with strategy.scope(): model = tf.keras.Sequential([ tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D(), tf.keras.layers.Flatten(), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(10, activation='softmax') ]) model.compile(loss='sparse_categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(),
  • 70. Full Stack Deep Learning Data-parallel PyTorch • Can be quite easy: 70Infrastructure & Tooling - Distributed Training
  • 71. Full Stack Deep Learning Data-parallel PyTorch • If want to spread over multiple machines, or make it work with model-parallel, gets more complicated 71Infrastructure & Tooling - Distributed Training In training code In consoles of the workers
  • 72. Full Stack Deep Learning Ray • Ray is open-source project for effortless, stateful distributed computing in Python 72 https://ray.readthedocs.io/en/latest/walkthrough.html
  • 73. Full Stack Deep Learning Horovod • Distributed training framework for Tensorflow, Keras, and PyTorch • Uses MPI (standard multi-process communication framework) instead of Tensorflow parameter servers • Could be an easier experience for multi-node training 73Infrastructure & Tooling - Distributed Training
  • 74. Full Stack Deep Learning Questions? 74
  • 75. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Experiment Management
  • 76. Full Stack Deep Learning Experiment Management • Even running one experiment at a time, can lose track of which code, parameters, and dataset generated which trained model. • When running multiple experiments, problem is much worse. 76Infrastructure & Tooling - Experiment Management https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb
  • 77. Full Stack Deep Learning 77 Tensorboard • A fine solution for single experiments • Gets unwieldy to manage many experiments, and to properly store past work Infrastructure & Tooling - Experiment Management
  • 78. Full Stack Deep Learning 78 Losswise Infrastructure & Tooling - Experiment Management
  • 79. Full Stack Deep Learning 79 Comet.ml Infrastructure & Tooling - Experiment Management
  • 80. Full Stack Deep Learning 80 Weights & Biases • What we use in Lab 3 Infrastructure & Tooling - Experiment Management
  • 81. Full Stack Deep Learning 81 MLFlow tracking Infrastructure & Tooling - Experiment Management https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb • Self-hosted solution from DataBricks
  • 82. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - Hyperparameter Tuning
  • 83. Full Stack Deep Learning Hyperparameter Optimization • Useful to have software that helps you search over hyper parameter settings. • Could be as simple as being able to provide `—lr=(0.0001, 0.1) -- num_layers=[128, 256, 512]` to training script 83Infrastructure & Tooling - Hyperparameter Tuning
  • 84. Full Stack Deep Learning 84Infrastructure & Tooling - Hyperparameter Tuning
  • 85. Full Stack Deep Learning 85Infrastructure & Tooling - Hyperparameter Tuning
  • 86. Full Stack Deep Learning Ray - Tune • "Choose among scalable SOTA algorithms such as Population Based Training (PBT), Vizier’s Median Stopping Rule, HyperBand/ASHA." • These redirect compute resources toward promising areas of search space 86Infrastructure & Tooling - Hyperparameter Tuning
  • 87. Full Stack Deep Learning 87 Can also visually inspect hyperparam results Infrastructure & Tooling - Hyperparameter Tuning • We will also see a solution from W&B in Lab 4
  • 88. Full Stack Deep Learning Questions? 88
  • 89. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Infrastructure & Tooling - All-in-one
  • 90. Full Stack Deep Learning All-in-one Solutions • Single system for everything • development (hosted notebook) • scaling experiments to many machines (sometimes even provisioning) • tracking experiments and versioning models • deploying models • monitoring performance 90Infrastructure & Tooling - All-in-one
  • 91. Full Stack Deep Learning 91Infrastructure & Tooling - All-in-one
  • 92. Full Stack Deep Learning 92Infrastructure & Tooling - All-in-one
  • 93. Full Stack Deep Learning 93Infrastructure & Tooling - All-in-one
  • 94. Full Stack Deep Learning 94Infrastructure & Tooling - All-in-one
  • 95. Full Stack Deep Learning 95Infrastructure & Tooling - All-in-one
  • 96. Full Stack Deep Learning 96 https://www.slideshare.net/AmazonWebServices/build-train-and-deploy-machine-learning-models-at-scale Infrastructure & Tooling - All-in-one 40% markup over corresponding EC2 instances
  • 97. Full Stack Deep Learning 97Infrastructure & Tooling - All-in-one
  • 98. Full Stack Deep Learning 98Infrastructure & Tooling - All-in-one
  • 99. Full Stack Deep Learning 99Infrastructure & Tooling - All-in-one
  • 100. Full Stack Deep Learning 100Infrastructure & Tooling - All-in-one • HyperBand based hyperparameter tuning • In-house distributed training module
  • 101. Full Stack Deep Learning Domino Data Lab 101 Provision compute Track experiments Infrastructure & Tooling - All-in-one
  • 102. Full Stack Deep Learning 102 Domino Data Lab Deploy REST API Monitor predictions Publish applets Infrastructure & Tooling - All-in-one
  • 103. Full Stack Deep Learning 103 Domino Data Lab Monitor spend All projects in one place Infrastructure & Tooling - All-in-one
  • 104. Full Stack Deep Learning 104 Neptune Paperspace Gradient W&B Comet.ml Floyd Algorithmia Determined.ai Amazon SageMaker GC ML Engine Domino Data Lab Hardware GCP or agnostic Paperspace N/A N/A GCP N/A Agnostic AWS GCP Agnostic Resource Management Yes Yes No No Yes No Yes*
 (but not provisioning) Yes Yes Yes Hyperparam Optimization Yes No Yes Yes No No Yes Yes Yes Yes Storing Models Yes Yes Yes Yes Yes No Yes Yes Yes Yes Reviewing Experiments Yes Yes Yes Yes Yes No Yes No Yes Yes Deploying Models as REST API No No No No Yes Yes Yes Yes Yes Yes Monitoring No No No No No Yes No Yes Yes Yes Infrastructure & Tooling - All-in-one
  • 105. Full Stack Deep Learning Questions? 105
  • 106. Full Stack Deep Learning Thank you! 106