SlideShare a Scribd company logo
1 of 114
Download to read offline
Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
Infrastructure & Tooling
Full Stack Deep Learning - UC Berkeley Spring 2021
Monitor predictions and close data
flywheel loop
Deploy model
Run experiments, review results
Provision compute
Write and debug model code
2
Provide data
Get optimal prediction system

as scalable API or mobile app
Dream Reality
Aggregate, process, clean, label,

and version data
Infrastructure & Tooling - Overview
Full Stack Deep Learning - UC Berkeley Spring 2021 3
Goal: add data, see model improve
Andrej Karpathy at PyTorch Devcon 2019 - https://www.youtube.com/watch?v=oBklltKXtDE
Infrastructure & Tooling - Overview
Full Stack Deep Learning - UC Berkeley Spring 2021
• Provide some labeled input-output pairs

• Press a button

• A prediction system that gets 100% accuracy on your data distribution is
live (as an infinitely-scalable API, or running embedded or mobile)
4
SE4ML: Software Engineering for Machine Learning (NIPS 2014
Infrastructure & Tooling - Overview
Full Stack Deep Learning - UC Berkeley Spring 2021 5
Infrastructure & Tooling - Overview
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation
6
Deployment
or or
Data
Infrastructure & Tooling - Overview
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Frameworks &
Distributed Training
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - Overview
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
8
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning - UC Berkeley Spring 2021
Programming Language
• Python, because of the libraries

• Clear winner in scientific and data computing
10
Full Stack Deep Learning - UC Berkeley Spring 2021
Editors
11
VS Code
Vim
PyCharm
Emacs
Jupyter
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning - UC Berkeley Spring 2021
Visual Studio Code
12
Infrastructure & Tooling - Software Engineering
• VS Code makes for a very nice Python
experience
Cvjmu.jo!hju!tubhjoh!boe!ejggjoh
Pqfo!xipmf!qspkfdut!sfnpufmz!
Qffl!epdvnfoubujpo
Mjou!dpef!bt!zpvs!xsjuf
Full Stack Deep Learning - UC Berkeley Spring 2021
Visual Studio Code
13
Cvjmu.jo!hju!tubhjoh!boe!ejggjoh
Pqfo!qspkfdut!sfnpufmz!
Vtf!uif!ufsnjobm
Infrastructure & Tooling - Software Engineering
Opufcppl!qpsu!gpsxbsejoh
Full Stack Deep Learning - UC Berkeley Spring 2021
Linters and Type Hints
• Whatever code style rules can be codified,
should be

• Static analysis can catch some bugs

• Static type checking both documents code
and catches bugs

• Will see in Lab 7
14
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning - UC Berkeley Spring 2021
Jupyter Notebooks
• Notebooks have become fundamental to
data science

• Great as the "first draft" of a project

• Jeremy Howard from fast.ai good to 

learn from (course.fast.ai videos)

• Difficult to make scalable, reproducible,
well-tested
15
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning - UC Berkeley Spring 2021
Problems with notebooks
• Hard to version

• Notebook "IDE" is primitive

• Very hard to test

• Out-of-order execution artifacts

• Hard to run long or distributed tasks
16
https://towardsdatascience.com/5-reasons-why-jupyter-notebooks-suck-4dc201e27086
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning - UC Berkeley Spring 2021
Jupyter Notebooks
• Counter-points:

• Netflix bases all ML workflows
on them
17
Infrastructure & Tooling - Software Engineering
https://medium.com/netflix-techblog/notebook-innovation-591ee3221233
Full Stack Deep Learning - UC Berkeley Spring 2021
NBDev
• Counter-points:

• Jeremy Howard from fast.ai
uses them for everything, with
nbdev
18
Infrastructure & Tooling - Software Engineering
https://github.com/fastai/nbdev
Full Stack Deep Learning - UC Berkeley Spring 2021
Streamlit
• New, but great at fulfilling a common ML need: interactive applets

• Decorate normal Python code

• Smart data caching, quick re-rendering

• In the works: sharing as easy as pushing a web app to Heroku
19
https://streamlit.io
Infrastructure & Tooling - Software Engineering
Full Stack Deep Learning - UC Berkeley Spring 2021
Setting up environment
20
https://github.com/full-stack-deep-learning/conda-piptools
(How we do it in lab)
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
21
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
23
Development Training/Evaluation
• Function
• Writing code

• Debugging models

• Looking at results

• Desiderata
• Quickly compile models and run training

• Nice-to-have: use GUI

• Solutions
• Desktop with 1-4 GPUs

• Cloud instance with 1-4 GPUs
• Function
• Model architecture / hyperparam search

• Training large models

• Desiderata
• Easy to launch experiments and review
results

• Solutions
• Desktop with 4 GPUs

• Private cluster of GPU machines

• Cloud cluster of GPU instances
or
or
Infrastructure & Tooling - Compute
Compute needs
Full Stack Deep Learning - UC Berkeley Spring 2021
Why compute matters
24
https://openai.com/blog/ai-and-compute/
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
25
Full Stack Deep Learning - UC Berkeley Spring 2021
So, or ?
• GPU Basics

• Cloud Options

• On-prem Options

• Analysis and Recommendations
26
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
GPU Basics
• NVIDIA has been the only game in town

• Google TPUs are the fastest (on GCP only)
27
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
28
GPU Comparison Table
Infrastructure & Tooling - Compute
https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
29
GPU Comparison Table
Infrastructure & Tooling - Compute
• New NVIDIA architecture every year

• Kepler —> Pascal —> Volta -> Turing -> Ampere

• Server version first, then “enthusiast”, then consumer.

• For business, only supposed to use server cards.
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
30
GPU Comparison Table
Infrastructure & Tooling - Compute
• RAM: should fit meaningful batches of your model
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
31
GPU Comparison Table
Infrastructure & Tooling - Compute
• RAM: should fit meaningful batches of your model

• 32bit vs Tensor TFlops
• Tensor Cores are specifically for deep learning operations (mixed precision)

• Good for convolutional/transformer models
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
32
GPU Comparison Table
Infrastructure & Tooling - Compute
• RAM: should fit meaningful batches of your model

• 32bit vs Tensor TFlops
• Tensor Cores are specifically for deep learning operations (mixed precision)

• Good for convolutional/transformer models

• Great speedups and bigger batches from 16bit mixed-precision
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
33
Kepler/Maxwell
• 2-4x slower than Pascal/Volta

• Hardware: don’t buy, too old

• Cloud: K80's are cheap (providers are stuck with what they bought)
Infrastructure & Tooling - Compute
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
34
Pascal
• Hardware: 1080 Ti still good if buying used, especially for recurrent

• Cloud: P100 is a mid-range option
Infrastructure & Tooling - Compute
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
35
Volta/Turing
• Preferred choice right now due to 16bit mixed precision support

• Hardware: 

• 2080 Ti is ~1.3x as fast as 1080 Ti in 32bit, but ~2x faster in 16bit

• Titan RTX is 10-20% faster yet. Titan V is just as good (but less RAM), if find used.

• Cloud: V100 is the ultimate for speed
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/
Infrastructure & Tooling - Compute
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
36
Ampere
• Latest hardware, with the most Tensor cores.

• At least 30% speedup over Turing
http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/
Infrastructure & Tooling - Compute
Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud
K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS
P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS
1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $

used
V100 2017H1 Volta Server 16 14 120 Yes $$$$
 AWS, GCP, MS
2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$
Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$
3090 2021H1
 Ampere
 Enthusiast 24 ?
 285 Yes
 $$
A100

00A10000

2020H1
 Ampere Server 40 19.5 312 Yes
 $$$$ AWS, GCP
Full Stack Deep Learning - UC Berkeley Spring 2021
37
https://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
A100 vs V100
38
https://lambdalabs.com/deep-learning/servers/hyperplane-a100
https://lambdalabs.com/blog/nvidia-a100-vs-v100-benchmarks/
Full Stack Deep Learning - UC Berkeley Spring 2021
Great resource
39
https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/#GPU_Recommendations
Full Stack Deep Learning - UC Berkeley Spring 2021
Cloud Providers
• Amazon Web Services, Google Cloud Platform, Microsoft Azure are the
heavyweights.

• Heavyweights are largely similar in function and price.

• AWS most expensive

• GCP just about as expensive, and has TPUs

• Azure reportedly has bad user experience

• Startups are Coreweave, Lambda Labs, and more
40
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
41
Amazon Web Services
Name GPU GPUs GPU RAM vCPU RAM On-demand
p2.16xlarge K80 8 12 64 732 $14.40
p3.16xlarge V100 8 16 64 488 $24.48
p3dn.24xlarge V100 8 32 96 768 $31.22
p4dn.16xlarge A100 8 40 96 1152 $32.78
• Three generations of GPUs

• Prices haven't moved in years
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
42
Google Cloud Platform
• Same lineup: K80, V100 (also P100), and A100

• In general, a little cheaper than AWS.

• Also has “Tensor Processing Units” (TPUs), the fastest option today.
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
43
Microsoft Azure
https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/
• Same lineup: K80, P100s, V100s (no A100s yet)

• Similar pricing
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Lambda Labs Cloud
44
https://lambdalabs.com/service/gpu-cloud
Full Stack Deep Learning - UC Berkeley Spring 2021
Coreweave
45
https://www.coreweave.com/pricing
Full Stack Deep Learning - UC Berkeley Spring 2021
On-prem Options
• Build your own

• Up to 4 Turing or 2 Ampere GPUs is easy

• Buy pre-built

• Lambda Labs, NVIDIA, and builders like Supermicro, Cirrascale, etc.
46
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Building your own
• Quiet PC with 128GB RAM and 2x
RTX 3090’s: $8000

• (If you can find them)

• One day to build and set up

• Going beyond 4 2000-series or 2
3000-series is painful

• All you need to know: http://
timdettmers.com/2018/12/16/
deep-learning-hardware-guide/
47
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Pre-built: Lambda Labs
48
Infrastructure & Tooling - Compute
 31&!npsf!fyqfotjwf!
uibo!cvjmejoh!zpvstfmg
Full Stack Deep Learning - UC Berkeley Spring 2021
Pre-built: Lambda Labs
49
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Cost Analysis
• Let’s first compare on-prem and equivalent cloud machines

• Then let’s also consider spot instances for experiment scaling
50
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Quad PC vs Quad Cloud
Verdict: not worth it. PC pays for itself in 5-10 weeks.
51
GPU Arch RAM Build Price Cloud Price
Hours =
Build
24/7 Weeks =
Build
16/5 Weeks =
Build
4x RTX
2080 Ti
Volta 12 $10000.00 0
4x V100 Volta 16 $12.00 833 5 10
Gvmm.ujnf!xpslmpbe
Xpsl!xffl!mpbe
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Quad PC vs Quad Cloud
52
https://l7.curtisnorthcutt.com/build-pro-deep-learning-workstation
Full Stack Deep Learning - UC Berkeley Spring 2021
Quad PC vs. Spot Instances
How to think about it: cloud enables quicker experiments.
53
ipvst
ipvst
Length of trial in experiment (hours) 6
Number of trials in experiment 16
Total GPU hours for experiment 96
Cost of 4x RTX 2080 Ti machine $10,000.00
Time to run experiment on 4x machine 24
Time to run experiment on V100 spot instances 6
Cost of provisioning enough pre-emptible V100s $96.00
Number of experiments that equal cost of 4x
machine
104
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Quad PC vs. Spot Instances
But at a pretty steep price.
54
ipvst
ipvst
Length of trial in experiment (hours) 6
Number of trials in experiment 16
Total GPU hours for experiment 96
Cost of 4x RTX 2080 Ti machine $10,000.00
Time to run experiment on 4x machine 24
Time to run experiment on V100 spot instances 6
Cost of provisioning enough pre-emptible V100s $96.00
Number of experiments that equal cost of 4x
machine
104
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
In Practice
• Even though cloud is expensive, it's hard to make on-prem scale past a
certain point

• Dev-ops (declarative infra, repeatable processes) definitely easier in the
cloud

• Maintenance is also a big factor
55
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Recommendation for hobbyist
• Development

• Build a 4x Turing or 2x Ampere PC

• Training/Evaluation

• Use the same PC, just always keep it running

• To scale out, use Lambda or Coreweave cloud instances.
56
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Recommendation for startup
• Development

• Buy a 4x Turing or 2x Ampere PC per ML Scientist

• Training/Evaluation
• To scale out, buy shared server machines or use cloud instances.
57
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Recommendation for larger company
• Development

• Buy 8x Turing or Ampere server rack per ML scientist

• Or, go straight to the cloud with V100 instances

• Training/Evaluation
• To scale out, use cloud instances with proper provisioning and handling
of failures
58
Infrastructure & Tooling - Compute
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
59
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning - UC Berkeley Spring 2021
61
Resource Management
• Function
• Multiple people…

• using multiple GPUs/machines…

• running different environments

• Goal
• Easy to launch a batch of experiments, with proper dependencies and resource allocations

• Solutions
• Python scripts

• SLURM

• Docker + Kubernetes

• Software specialized for ML use cases
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning - UC Berkeley Spring 2021 62
• Problem we’re solving: allocate free
resources to programs

• Can be scripted pretty easily

• Even better, use old-school cluster job
scheduler

• Job defines necessary resources, gets
queued
Scripts or
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning - UC Berkeley Spring 2021 63
• Docker is a way to package up an entire
dependency stack in a lighter-than-a-VM
package

• We will talk more about Docker in the
Deployment lecture, and use it in lab
Docker + Kubernetes
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning - UC Berkeley Spring 2021 64
• Kubernetes is a way to run many docker
containers on top of a cluster
Docker + Kubernetes
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning - UC Berkeley Spring 2021
• Open source project from Google

• Spawn and manage Jupyter
notebooks

• Manage multi-step ML workflows

• Plug-ins for hyperparameter tuning,
model deployment
65
https://github.com/Langhalsdino/Kubernetes-GPU-Guide
Infrastructure & Tooling - Resource Management
Full Stack Deep Learning - UC Berkeley Spring 2021
This is an active area
• All-in-one solutions like SageMaker and Paperspace Gradient do this

• And some recently-announced startups have it as their goal...
66
Full Stack Deep Learning - UC Berkeley Spring 2021
Anyscale (makers of Ray, from Berkeley)
67
Full Stack Deep Learning - UC Berkeley Spring 2021
Grid.ai
68
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
69
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - Frameworks
Full Stack Deep Learning - UC Berkeley Spring 2021
71
Deep Learning Frameworks
• Unless you have a good reason not to, use
Tensorflow/Keras or PyTorch

• Both have converged to the same point:

• easy development via define-by-run

• multi-platform optimized execution graph

• Today, most new projects use PyTorch, because
of its more Python dev-friendly experience

• fast.ai library builds on PyTorch with best
practices

• PyTorch-Lightning adds a powerful training loop
Good
for
production
Good for development
Tensorflow 2.0
+ eager execution
+ TorchScript
PyTorch 1.0
Infrastructure & Tooling - Frameworks
3126
3126
3128
3129
312:
Full Stack Deep Learning - UC Berkeley Spring 2021
PyTorch dominates new development
72
https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/
Infrastructure & Tooling - Frameworks
Full Stack Deep Learning - UC Berkeley Spring 2021
PyTorch and Tensorflow
73
Full Stack Deep Learning - UC Berkeley Spring 2021
Why do we need frameworks?
• Deep Learning is not lot of code with a
matrix math library (Numpy)

• BUT: auto-differentiation and CUDA are a
lot of work

• ...Also all the layer types, optimizers, data
interfaces, etc.
74
Full Stack Deep Learning - UC Berkeley Spring 2021
• Recent project from Google that's gaining steam

• Numpy + auto-differentiation and compilation to GPU/TPU code

• Not just for deep learning!
75
https://github.com/google/jax#quickstart-colab-in-the-cloud
Full Stack Deep Learning - UC Berkeley Spring 2021
Hugging face
• Tons of NLP-focused model architectures (and pre-trained weights) for
both PyTorch and Tensorflow
76
Full Stack Deep Learning - UC Berkeley Spring 2021
Distributed Training
• Using multiple GPUs and/or machines to train a single model.

• More complex than simply running different experiments on different
GPUs

• A must-do on big datasets and large models
77
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning - UC Berkeley Spring 2021
Data Parallelism
• If iteration time is too long, try training in data
parallel regime

• "For convolution, expect 1.9x/3.5x speedup
for 2/4 GPUs.

http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/
78
http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf
https://lambdalabs.com/blog/titan-v-deep-learning-benchmarks/
Uijt!jt!bwfsbhf!pwfs!
ejggfsfou!dpowofut
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning - UC Berkeley Spring 2021
Model Parallelism
79
http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf
• Model parallelism is necessary when
model does not fit on a single GPU

• Introduces a lot of complexity and is
usually not worth it. (But this is
changing.)

• Better to buy the largest GPU you can,
and/or use gradient checkpointing
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning - UC Berkeley Spring 2021
Data-parallel PyTorch
• Can be quite easy:
80
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning - UC Berkeley Spring 2021
Data-parallel PyTorch-Lightning
• Single node:

python training/
run_experiment.py --gpus=8 --
accelerator=ddp

• And using SLURM to run on
multiple nodes:

python training/
run_experiment.py --gpus=8 --
accelerator=ddp --nodes=4
81
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning - UC Berkeley Spring 2021
Horovod
• Distributed training framework for Tensorflow, Keras, and PyTorch

• Uses MPI (standard multi-process communication framework) instead of
Tensorflow parameter servers

• Could be an easier experience for multi-node training
82
Infrastructure & Tooling - Distributed Training
Full Stack Deep Learning - UC Berkeley Spring 2021
Ray (from Anyscale)
• Ray is open-source project
for effortless, stateful
distributed computing in
Python

• From Berkeley!
83
https://ray.readthedocs.io/en/latest/walkthrough.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
84
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning - UC Berkeley Spring 2021
Experiment Management
• Even running one experiment at a time, can lose track of which code,
parameters, and dataset generated which trained model.

• When running multiple experiments, problem is much worse.
86
Infrastructure & Tooling - Experiment Management
https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb
Full Stack Deep Learning - UC Berkeley Spring 2021
87
Tensorboard
• A fine solution for single
experiments

• Gets unwieldy to manage many
experiments, and to properly
store past work
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning - UC Berkeley Spring 2021
88
MLFlow tracking
Infrastructure & Tooling - Experiment Management
https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb
• Self-hosted solution
from DataBricks
Full Stack Deep Learning - UC Berkeley Spring 2021
89
Comet.ml
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning - UC Berkeley Spring 2021
90
• What we use in lab
Infrastructure & Tooling - Experiment Management
Full Stack Deep Learning - UC Berkeley Spring 2021
91
• Publish reports with
embedded charts,
figures, etc
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning - UC Berkeley Spring 2021
Hyperparameter Optimization
• Useful to have software that helps you search over hyper parameter
settings.

• Could be as simple as being able to provide `—lr=(0.0001, 0.1) --
num_layers=[128, 256, 512]` to training script

• Would be even better if settings were selected intelligently, and
underperforming runs stopped early.
93
Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning - UC Berkeley Spring 2021 94
Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning - UC Berkeley Spring 2021
Ray Tune
• "Choose among scalable SOTA
algorithms such as Population
Based Training (PBT), Vizier’s
Median Stopping Rule,
HyperBand/ASHA."

• These redirect compute resources
toward promising areas of search
space
95
Infrastructure & Tooling - Hyperparameter Tuning
Full Stack Deep Learning - UC Berkeley Spring 2021 96
Infrastructure & Tooling - Hyperparameter Tuning
• What we use in Lab
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
97
Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Training/Evaluation Deployment
or or
Data
Web
Versioning
Monitoring
CI / Testing
Exploration
Sources
Hyperparameter Tuning
Processing
Compute
Labeling
Software Engineering
Edge
Resource Management
Experiment Management
Data Lake / Warehouse
Feature
Store
Frameworks &
Distributed Training
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021
All-in-one Solutions
• Single system for everything

• development (hosted notebook)

• scaling experiments to many machines (sometimes even provisioning)

• tracking experiments and versioning models

• deploying models

• monitoring performance
99
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021 100
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021 101
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021 102
https://www.slideshare.net/AmazonWebServices/build-train-and-deploy-machine-learning-models-at-scale
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021 103
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021 104
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021 105
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021 106
Infrastructure & Tooling - All-in-one
• Open source!

• In-house distributed training module

• HyperBand based hyperparameter
tuning

• Smart GPU scheduling, including spot
instances

• Experiment tracking
Full Stack Deep Learning - UC Berkeley Spring 2021
Domino Data Lab
107
Provision compute
Track experiments
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021
108
Domino Data Lab
Deploy REST API
Monitor predictions
Publish applets
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021
109
Domino Data Lab
Monitor spend
All projects in one place
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021
Natural place to go for most MLOps companies
110
Full Stack Deep Learning - UC Berkeley Spring 2021
111
W&B
Paperspace
Gradient
Floyd Determined.ai Domino Data Lab
Amazon
SageMaker
GC ML
Engine
Hardware N/A Paperspace GCP Agnostic Agnostic AWS GCP
Resource Management No Yes Yes Yes Yes Yes Yes
Hyperparam Optimization Yes No No Yes Yes Yes Yes
Storing Models Yes Yes Yes Yes Yes Yes Yes
Reviewing Experiments Yes Yes Yes Yes Yes Yes Yes
Deploying Models as REST API No No Yes No Yes Yes Yes
Monitoring No No No No Yes Yes Yes
Infrastructure & Tooling - All-in-one
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
112
Full Stack Deep Learning - UC Berkeley Spring 2021
Tooling Tuesdays
113
https://twitter.com/full_stack_dl
Full Stack Deep Learning - UC Berkeley Spring 2021
Thank you!
114

More Related Content

What's hot

Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Neo4j
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Fivetran pitch deck
Fivetran pitch deckFivetran pitch deck
Fivetran pitch deckTech in Asia
 
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...Neo4j
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Understanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityDevOps.com
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY
 

What's hot (20)

Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
Optimizing the Supply Chain with Knowledge Graphs, IoT and Digital Twins_Moor...
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Fivetran pitch deck
Fivetran pitch deckFivetran pitch deck
Fivetran pitch deck
 
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
MLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at Scale
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Understanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application QualityUnderstanding DataOps and Its Impact on Application Quality
Understanding DataOps and Its Impact on Application Quality
 
Data Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & ApproachesData Lake Architecture – Modern Strategies & Approaches
Data Lake Architecture – Modern Strategies & Approaches
 

Similar to Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)

Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaIntel® Software
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018NVIDIA
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Iulian Pintoiu
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC
 
OpenACC Monthly Highlights: February 2021
OpenACC Monthly Highlights: February 2021OpenACC Monthly Highlights: February 2021
OpenACC Monthly Highlights: February 2021OpenACC
 
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performanceAI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performanceinside-BigData.com
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsIgor José F. Freitas
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceData Works MD
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC
 

Similar to Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021) (20)

Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
Python Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and AnacondaPython Data Science and Machine Learning at Scale with Intel and Anaconda
Python Data Science and Machine Learning at Scale with Intel and Anaconda
 
PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020OpenACC Monthly Highlights: November 2020
OpenACC Monthly Highlights: November 2020
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 
OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020OpenACC Monthly Highlights: June 2020
OpenACC Monthly Highlights: June 2020
 
OpenACC Monthly Highlights: February 2021
OpenACC Monthly Highlights: February 2021OpenACC Monthly Highlights: February 2021
OpenACC Monthly Highlights: February 2021
 
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performanceAI Bridging Cloud Infrastructure (ABCI) and its communication performance
AI Bridging Cloud Infrastructure (ABCI) and its communication performance
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020OpenACC Monthly Highlights: August 2020
OpenACC Monthly Highlights: August 2020
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020OpenACC Monthly Highlights: May 2020
OpenACC Monthly Highlights: May 2020
 
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
 

More from Sergey Karayev

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Sergey Karayev
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningSergey Karayev
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningSergey Karayev
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSergey Karayev
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019Sergey Karayev
 

More from Sergey Karayev (20)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep Learning
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 

Recently uploaded

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)

  • 1. Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Infrastructure & Tooling
  • 2. Full Stack Deep Learning - UC Berkeley Spring 2021 Monitor predictions and close data flywheel loop Deploy model Run experiments, review results Provision compute Write and debug model code 2 Provide data Get optimal prediction system
 as scalable API or mobile app Dream Reality Aggregate, process, clean, label,
 and version data Infrastructure & Tooling - Overview
  • 3. Full Stack Deep Learning - UC Berkeley Spring 2021 3 Goal: add data, see model improve Andrej Karpathy at PyTorch Devcon 2019 - https://www.youtube.com/watch?v=oBklltKXtDE Infrastructure & Tooling - Overview
  • 4. Full Stack Deep Learning - UC Berkeley Spring 2021 • Provide some labeled input-output pairs • Press a button • A prediction system that gets 100% accuracy on your data distribution is live (as an infinitely-scalable API, or running embedded or mobile) 4 SE4ML: Software Engineering for Machine Learning (NIPS 2014 Infrastructure & Tooling - Overview
  • 5. Full Stack Deep Learning - UC Berkeley Spring 2021 5 Infrastructure & Tooling - Overview
  • 6. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation 6 Deployment or or Data Infrastructure & Tooling - Overview Web Versioning Monitoring CI / Testing Exploration Sources Frameworks & Distributed Training Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store
  • 7. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - Overview
  • 8. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 8
  • 9. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - Software Engineering
  • 10. Full Stack Deep Learning - UC Berkeley Spring 2021 Programming Language • Python, because of the libraries • Clear winner in scientific and data computing 10
  • 11. Full Stack Deep Learning - UC Berkeley Spring 2021 Editors 11 VS Code Vim PyCharm Emacs Jupyter Infrastructure & Tooling - Software Engineering
  • 12. Full Stack Deep Learning - UC Berkeley Spring 2021 Visual Studio Code 12 Infrastructure & Tooling - Software Engineering • VS Code makes for a very nice Python experience Cvjmu.jo!hju!tubhjoh!boe!ejggjoh Pqfo!xipmf!qspkfdut!sfnpufmz! Qffl!epdvnfoubujpo Mjou!dpef!bt!zpvs!xsjuf
  • 13. Full Stack Deep Learning - UC Berkeley Spring 2021 Visual Studio Code 13 Cvjmu.jo!hju!tubhjoh!boe!ejggjoh Pqfo!qspkfdut!sfnpufmz! Vtf!uif!ufsnjobm Infrastructure & Tooling - Software Engineering Opufcppl!qpsu!gpsxbsejoh
  • 14. Full Stack Deep Learning - UC Berkeley Spring 2021 Linters and Type Hints • Whatever code style rules can be codified, should be • Static analysis can catch some bugs • Static type checking both documents code and catches bugs • Will see in Lab 7 14 Infrastructure & Tooling - Software Engineering
  • 15. Full Stack Deep Learning - UC Berkeley Spring 2021 Jupyter Notebooks • Notebooks have become fundamental to data science • Great as the "first draft" of a project • Jeremy Howard from fast.ai good to 
 learn from (course.fast.ai videos) • Difficult to make scalable, reproducible, well-tested 15 Infrastructure & Tooling - Software Engineering
  • 16. Full Stack Deep Learning - UC Berkeley Spring 2021 Problems with notebooks • Hard to version • Notebook "IDE" is primitive • Very hard to test • Out-of-order execution artifacts • Hard to run long or distributed tasks 16 https://towardsdatascience.com/5-reasons-why-jupyter-notebooks-suck-4dc201e27086 Infrastructure & Tooling - Software Engineering
  • 17. Full Stack Deep Learning - UC Berkeley Spring 2021 Jupyter Notebooks • Counter-points: • Netflix bases all ML workflows on them 17 Infrastructure & Tooling - Software Engineering https://medium.com/netflix-techblog/notebook-innovation-591ee3221233
  • 18. Full Stack Deep Learning - UC Berkeley Spring 2021 NBDev • Counter-points: • Jeremy Howard from fast.ai uses them for everything, with nbdev 18 Infrastructure & Tooling - Software Engineering https://github.com/fastai/nbdev
  • 19. Full Stack Deep Learning - UC Berkeley Spring 2021 Streamlit • New, but great at fulfilling a common ML need: interactive applets • Decorate normal Python code • Smart data caching, quick re-rendering • In the works: sharing as easy as pushing a web app to Heroku 19 https://streamlit.io Infrastructure & Tooling - Software Engineering
  • 20. Full Stack Deep Learning - UC Berkeley Spring 2021 Setting up environment 20 https://github.com/full-stack-deep-learning/conda-piptools (How we do it in lab)
  • 21. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 21
  • 22. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - Compute
  • 23. Full Stack Deep Learning - UC Berkeley Spring 2021 23 Development Training/Evaluation • Function • Writing code • Debugging models • Looking at results • Desiderata • Quickly compile models and run training • Nice-to-have: use GUI • Solutions • Desktop with 1-4 GPUs • Cloud instance with 1-4 GPUs • Function • Model architecture / hyperparam search • Training large models • Desiderata • Easy to launch experiments and review results • Solutions • Desktop with 4 GPUs • Private cluster of GPU machines • Cloud cluster of GPU instances or or Infrastructure & Tooling - Compute Compute needs
  • 24. Full Stack Deep Learning - UC Berkeley Spring 2021 Why compute matters 24 https://openai.com/blog/ai-and-compute/ Infrastructure & Tooling - Compute
  • 25. Full Stack Deep Learning - UC Berkeley Spring 2021 25
  • 26. Full Stack Deep Learning - UC Berkeley Spring 2021 So, or ? • GPU Basics • Cloud Options • On-prem Options • Analysis and Recommendations 26 Infrastructure & Tooling - Compute
  • 27. Full Stack Deep Learning - UC Berkeley Spring 2021 GPU Basics • NVIDIA has been the only game in town • Google TPUs are the fastest (on GCP only) 27 Infrastructure & Tooling - Compute
  • 28. Full Stack Deep Learning - UC Berkeley Spring 2021 28 GPU Comparison Table Infrastructure & Tooling - Compute https://www.microway.com/knowledge-center-articles/comparison-of-nvidia-geforce-gpus-and-nvidia-tesla-gpus/ Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 29. Full Stack Deep Learning - UC Berkeley Spring 2021 29 GPU Comparison Table Infrastructure & Tooling - Compute • New NVIDIA architecture every year • Kepler —> Pascal —> Volta -> Turing -> Ampere • Server version first, then “enthusiast”, then consumer. • For business, only supposed to use server cards. Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 30. Full Stack Deep Learning - UC Berkeley Spring 2021 30 GPU Comparison Table Infrastructure & Tooling - Compute • RAM: should fit meaningful batches of your model http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 31. Full Stack Deep Learning - UC Berkeley Spring 2021 31 GPU Comparison Table Infrastructure & Tooling - Compute • RAM: should fit meaningful batches of your model • 32bit vs Tensor TFlops • Tensor Cores are specifically for deep learning operations (mixed precision) • Good for convolutional/transformer models http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 32. Full Stack Deep Learning - UC Berkeley Spring 2021 32 GPU Comparison Table Infrastructure & Tooling - Compute • RAM: should fit meaningful batches of your model • 32bit vs Tensor TFlops • Tensor Cores are specifically for deep learning operations (mixed precision) • Good for convolutional/transformer models • Great speedups and bigger batches from 16bit mixed-precision http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 33. Full Stack Deep Learning - UC Berkeley Spring 2021 33 Kepler/Maxwell • 2-4x slower than Pascal/Volta • Hardware: don’t buy, too old • Cloud: K80's are cheap (providers are stuck with what they bought) Infrastructure & Tooling - Compute Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 34. Full Stack Deep Learning - UC Berkeley Spring 2021 34 Pascal • Hardware: 1080 Ti still good if buying used, especially for recurrent • Cloud: P100 is a mid-range option Infrastructure & Tooling - Compute Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 35. Full Stack Deep Learning - UC Berkeley Spring 2021 35 Volta/Turing • Preferred choice right now due to 16bit mixed precision support • Hardware: • 2080 Ti is ~1.3x as fast as 1080 Ti in 32bit, but ~2x faster in 16bit • Titan RTX is 10-20% faster yet. Titan V is just as good (but less RAM), if find used. • Cloud: V100 is the ultimate for speed http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/ Infrastructure & Tooling - Compute Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 36. Full Stack Deep Learning - UC Berkeley Spring 2021 36 Ampere • Latest hardware, with the most Tensor cores. • At least 30% speedup over Turing http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ https://lambdalabs.com/blog/titan-rtx-tensorflow-benchmarks/ Infrastructure & Tooling - Compute Card Release Arch Use-case RAM (Gb) 32bit TFlops Tensor TFlops 16bit Cost Cloud K80 2014H2 Kepler Server 24 5 N/A No - AWS, GCP, MS P100 2016H1 Pascal Server 16 10 N/A Yes - GCP, MS 1080 Ti 2017H1 Pascal Consumer 11 13 N/A No $ used V100 2017H1 Volta Server 16 14 120 Yes $$$$ AWS, GCP, MS 2080 Ti 2018H2 Turing Consumer 11 13 107 Yes $$ Titan RTX 2018H2 Turing Enthusiast 24 16 130 Yes $$ 3090 2021H1 Ampere Enthusiast 24 ? 285 Yes $$ A100 00A10000 2020H1 Ampere Server 40 19.5 312 Yes $$$$ AWS, GCP
  • 37. Full Stack Deep Learning - UC Berkeley Spring 2021 37 https://timdettmers.com/2019/04/03/which-gpu-for-deep-learning/ Infrastructure & Tooling - Compute
  • 38. Full Stack Deep Learning - UC Berkeley Spring 2021 A100 vs V100 38 https://lambdalabs.com/deep-learning/servers/hyperplane-a100 https://lambdalabs.com/blog/nvidia-a100-vs-v100-benchmarks/
  • 39. Full Stack Deep Learning - UC Berkeley Spring 2021 Great resource 39 https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/#GPU_Recommendations
  • 40. Full Stack Deep Learning - UC Berkeley Spring 2021 Cloud Providers • Amazon Web Services, Google Cloud Platform, Microsoft Azure are the heavyweights. • Heavyweights are largely similar in function and price. • AWS most expensive • GCP just about as expensive, and has TPUs • Azure reportedly has bad user experience • Startups are Coreweave, Lambda Labs, and more 40 Infrastructure & Tooling - Compute
  • 41. Full Stack Deep Learning - UC Berkeley Spring 2021 41 Amazon Web Services Name GPU GPUs GPU RAM vCPU RAM On-demand p2.16xlarge K80 8 12 64 732 $14.40 p3.16xlarge V100 8 16 64 488 $24.48 p3dn.24xlarge V100 8 32 96 768 $31.22 p4dn.16xlarge A100 8 40 96 1152 $32.78 • Three generations of GPUs • Prices haven't moved in years Infrastructure & Tooling - Compute
  • 42. Full Stack Deep Learning - UC Berkeley Spring 2021 42 Google Cloud Platform • Same lineup: K80, V100 (also P100), and A100 • In general, a little cheaper than AWS. • Also has “Tensor Processing Units” (TPUs), the fastest option today. Infrastructure & Tooling - Compute
  • 43. Full Stack Deep Learning - UC Berkeley Spring 2021 43 Microsoft Azure https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/ • Same lineup: K80, P100s, V100s (no A100s yet) • Similar pricing Infrastructure & Tooling - Compute
  • 44. Full Stack Deep Learning - UC Berkeley Spring 2021 Lambda Labs Cloud 44 https://lambdalabs.com/service/gpu-cloud
  • 45. Full Stack Deep Learning - UC Berkeley Spring 2021 Coreweave 45 https://www.coreweave.com/pricing
  • 46. Full Stack Deep Learning - UC Berkeley Spring 2021 On-prem Options • Build your own • Up to 4 Turing or 2 Ampere GPUs is easy • Buy pre-built • Lambda Labs, NVIDIA, and builders like Supermicro, Cirrascale, etc. 46 Infrastructure & Tooling - Compute
  • 47. Full Stack Deep Learning - UC Berkeley Spring 2021 Building your own • Quiet PC with 128GB RAM and 2x RTX 3090’s: $8000 • (If you can find them) • One day to build and set up • Going beyond 4 2000-series or 2 3000-series is painful • All you need to know: http:// timdettmers.com/2018/12/16/ deep-learning-hardware-guide/ 47 Infrastructure & Tooling - Compute
  • 48. Full Stack Deep Learning - UC Berkeley Spring 2021 Pre-built: Lambda Labs 48 Infrastructure & Tooling - Compute  31&!npsf!fyqfotjwf! uibo!cvjmejoh!zpvstfmg
  • 49. Full Stack Deep Learning - UC Berkeley Spring 2021 Pre-built: Lambda Labs 49 Infrastructure & Tooling - Compute
  • 50. Full Stack Deep Learning - UC Berkeley Spring 2021 Cost Analysis • Let’s first compare on-prem and equivalent cloud machines • Then let’s also consider spot instances for experiment scaling 50 Infrastructure & Tooling - Compute
  • 51. Full Stack Deep Learning - UC Berkeley Spring 2021 Quad PC vs Quad Cloud Verdict: not worth it. PC pays for itself in 5-10 weeks. 51 GPU Arch RAM Build Price Cloud Price Hours = Build 24/7 Weeks = Build 16/5 Weeks = Build 4x RTX 2080 Ti Volta 12 $10000.00 0 4x V100 Volta 16 $12.00 833 5 10 Gvmm.ujnf!xpslmpbe Xpsl!xffl!mpbe Infrastructure & Tooling - Compute
  • 52. Full Stack Deep Learning - UC Berkeley Spring 2021 Quad PC vs Quad Cloud 52 https://l7.curtisnorthcutt.com/build-pro-deep-learning-workstation
  • 53. Full Stack Deep Learning - UC Berkeley Spring 2021 Quad PC vs. Spot Instances How to think about it: cloud enables quicker experiments. 53 ipvst ipvst Length of trial in experiment (hours) 6 Number of trials in experiment 16 Total GPU hours for experiment 96 Cost of 4x RTX 2080 Ti machine $10,000.00 Time to run experiment on 4x machine 24 Time to run experiment on V100 spot instances 6 Cost of provisioning enough pre-emptible V100s $96.00 Number of experiments that equal cost of 4x machine 104 Infrastructure & Tooling - Compute
  • 54. Full Stack Deep Learning - UC Berkeley Spring 2021 Quad PC vs. Spot Instances But at a pretty steep price. 54 ipvst ipvst Length of trial in experiment (hours) 6 Number of trials in experiment 16 Total GPU hours for experiment 96 Cost of 4x RTX 2080 Ti machine $10,000.00 Time to run experiment on 4x machine 24 Time to run experiment on V100 spot instances 6 Cost of provisioning enough pre-emptible V100s $96.00 Number of experiments that equal cost of 4x machine 104 Infrastructure & Tooling - Compute
  • 55. Full Stack Deep Learning - UC Berkeley Spring 2021 In Practice • Even though cloud is expensive, it's hard to make on-prem scale past a certain point • Dev-ops (declarative infra, repeatable processes) definitely easier in the cloud • Maintenance is also a big factor 55 Infrastructure & Tooling - Compute
  • 56. Full Stack Deep Learning - UC Berkeley Spring 2021 Recommendation for hobbyist • Development • Build a 4x Turing or 2x Ampere PC • Training/Evaluation • Use the same PC, just always keep it running • To scale out, use Lambda or Coreweave cloud instances. 56 Infrastructure & Tooling - Compute
  • 57. Full Stack Deep Learning - UC Berkeley Spring 2021 Recommendation for startup • Development • Buy a 4x Turing or 2x Ampere PC per ML Scientist • Training/Evaluation • To scale out, buy shared server machines or use cloud instances. 57 Infrastructure & Tooling - Compute
  • 58. Full Stack Deep Learning - UC Berkeley Spring 2021 Recommendation for larger company • Development • Buy 8x Turing or Ampere server rack per ML scientist • Or, go straight to the cloud with V100 instances • Training/Evaluation • To scale out, use cloud instances with proper provisioning and handling of failures 58 Infrastructure & Tooling - Compute
  • 59. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 59
  • 60. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - Resource Management
  • 61. Full Stack Deep Learning - UC Berkeley Spring 2021 61 Resource Management • Function • Multiple people… • using multiple GPUs/machines… • running different environments • Goal • Easy to launch a batch of experiments, with proper dependencies and resource allocations • Solutions • Python scripts • SLURM • Docker + Kubernetes • Software specialized for ML use cases Infrastructure & Tooling - Resource Management
  • 62. Full Stack Deep Learning - UC Berkeley Spring 2021 62 • Problem we’re solving: allocate free resources to programs • Can be scripted pretty easily • Even better, use old-school cluster job scheduler • Job defines necessary resources, gets queued Scripts or Infrastructure & Tooling - Resource Management
  • 63. Full Stack Deep Learning - UC Berkeley Spring 2021 63 • Docker is a way to package up an entire dependency stack in a lighter-than-a-VM package • We will talk more about Docker in the Deployment lecture, and use it in lab Docker + Kubernetes Infrastructure & Tooling - Resource Management
  • 64. Full Stack Deep Learning - UC Berkeley Spring 2021 64 • Kubernetes is a way to run many docker containers on top of a cluster Docker + Kubernetes Infrastructure & Tooling - Resource Management
  • 65. Full Stack Deep Learning - UC Berkeley Spring 2021 • Open source project from Google • Spawn and manage Jupyter notebooks • Manage multi-step ML workflows • Plug-ins for hyperparameter tuning, model deployment 65 https://github.com/Langhalsdino/Kubernetes-GPU-Guide Infrastructure & Tooling - Resource Management
  • 66. Full Stack Deep Learning - UC Berkeley Spring 2021 This is an active area • All-in-one solutions like SageMaker and Paperspace Gradient do this • And some recently-announced startups have it as their goal... 66
  • 67. Full Stack Deep Learning - UC Berkeley Spring 2021 Anyscale (makers of Ray, from Berkeley) 67
  • 68. Full Stack Deep Learning - UC Berkeley Spring 2021 Grid.ai 68
  • 69. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 69
  • 70. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - Frameworks
  • 71. Full Stack Deep Learning - UC Berkeley Spring 2021 71 Deep Learning Frameworks • Unless you have a good reason not to, use Tensorflow/Keras or PyTorch • Both have converged to the same point: • easy development via define-by-run • multi-platform optimized execution graph • Today, most new projects use PyTorch, because of its more Python dev-friendly experience • fast.ai library builds on PyTorch with best practices • PyTorch-Lightning adds a powerful training loop Good for production Good for development Tensorflow 2.0 + eager execution + TorchScript PyTorch 1.0 Infrastructure & Tooling - Frameworks 3126 3126 3128 3129 312:
  • 72. Full Stack Deep Learning - UC Berkeley Spring 2021 PyTorch dominates new development 72 https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/ Infrastructure & Tooling - Frameworks
  • 73. Full Stack Deep Learning - UC Berkeley Spring 2021 PyTorch and Tensorflow 73
  • 74. Full Stack Deep Learning - UC Berkeley Spring 2021 Why do we need frameworks? • Deep Learning is not lot of code with a matrix math library (Numpy) • BUT: auto-differentiation and CUDA are a lot of work • ...Also all the layer types, optimizers, data interfaces, etc. 74
  • 75. Full Stack Deep Learning - UC Berkeley Spring 2021 • Recent project from Google that's gaining steam • Numpy + auto-differentiation and compilation to GPU/TPU code • Not just for deep learning! 75 https://github.com/google/jax#quickstart-colab-in-the-cloud
  • 76. Full Stack Deep Learning - UC Berkeley Spring 2021 Hugging face • Tons of NLP-focused model architectures (and pre-trained weights) for both PyTorch and Tensorflow 76
  • 77. Full Stack Deep Learning - UC Berkeley Spring 2021 Distributed Training • Using multiple GPUs and/or machines to train a single model. • More complex than simply running different experiments on different GPUs • A must-do on big datasets and large models 77 Infrastructure & Tooling - Distributed Training
  • 78. Full Stack Deep Learning - UC Berkeley Spring 2021 Data Parallelism • If iteration time is too long, try training in data parallel regime • "For convolution, expect 1.9x/3.5x speedup for 2/4 GPUs.
 http://timdettmers.com/2018/11/05/which-gpu-for-deep-learning/ 78 http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf https://lambdalabs.com/blog/titan-v-deep-learning-benchmarks/ Uijt!jt!bwfsbhf!pwfs! ejggfsfou!dpowofut Infrastructure & Tooling - Distributed Training
  • 79. Full Stack Deep Learning - UC Berkeley Spring 2021 Model Parallelism 79 http://www.cs.cmu.edu/~pengtaox/papers/petuum_15.pdf • Model parallelism is necessary when model does not fit on a single GPU • Introduces a lot of complexity and is usually not worth it. (But this is changing.) • Better to buy the largest GPU you can, and/or use gradient checkpointing Infrastructure & Tooling - Distributed Training
  • 80. Full Stack Deep Learning - UC Berkeley Spring 2021 Data-parallel PyTorch • Can be quite easy: 80 Infrastructure & Tooling - Distributed Training
  • 81. Full Stack Deep Learning - UC Berkeley Spring 2021 Data-parallel PyTorch-Lightning • Single node:
 python training/ run_experiment.py --gpus=8 -- accelerator=ddp • And using SLURM to run on multiple nodes:
 python training/ run_experiment.py --gpus=8 -- accelerator=ddp --nodes=4 81 Infrastructure & Tooling - Distributed Training
  • 82. Full Stack Deep Learning - UC Berkeley Spring 2021 Horovod • Distributed training framework for Tensorflow, Keras, and PyTorch • Uses MPI (standard multi-process communication framework) instead of Tensorflow parameter servers • Could be an easier experience for multi-node training 82 Infrastructure & Tooling - Distributed Training
  • 83. Full Stack Deep Learning - UC Berkeley Spring 2021 Ray (from Anyscale) • Ray is open-source project for effortless, stateful distributed computing in Python • From Berkeley! 83 https://ray.readthedocs.io/en/latest/walkthrough.html
  • 84. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 84
  • 85. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - Experiment Management
  • 86. Full Stack Deep Learning - UC Berkeley Spring 2021 Experiment Management • Even running one experiment at a time, can lose track of which code, parameters, and dataset generated which trained model. • When running multiple experiments, problem is much worse. 86 Infrastructure & Tooling - Experiment Management https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb
  • 87. Full Stack Deep Learning - UC Berkeley Spring 2021 87 Tensorboard • A fine solution for single experiments • Gets unwieldy to manage many experiments, and to properly store past work Infrastructure & Tooling - Experiment Management
  • 88. Full Stack Deep Learning - UC Berkeley Spring 2021 88 MLFlow tracking Infrastructure & Tooling - Experiment Management https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb • Self-hosted solution from DataBricks
  • 89. Full Stack Deep Learning - UC Berkeley Spring 2021 89 Comet.ml Infrastructure & Tooling - Experiment Management
  • 90. Full Stack Deep Learning - UC Berkeley Spring 2021 90 • What we use in lab Infrastructure & Tooling - Experiment Management
  • 91. Full Stack Deep Learning - UC Berkeley Spring 2021 91 • Publish reports with embedded charts, figures, etc
  • 92. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - Hyperparameter Tuning
  • 93. Full Stack Deep Learning - UC Berkeley Spring 2021 Hyperparameter Optimization • Useful to have software that helps you search over hyper parameter settings. • Could be as simple as being able to provide `—lr=(0.0001, 0.1) -- num_layers=[128, 256, 512]` to training script • Would be even better if settings were selected intelligently, and underperforming runs stopped early. 93 Infrastructure & Tooling - Hyperparameter Tuning
  • 94. Full Stack Deep Learning - UC Berkeley Spring 2021 94 Infrastructure & Tooling - Hyperparameter Tuning
  • 95. Full Stack Deep Learning - UC Berkeley Spring 2021 Ray Tune • "Choose among scalable SOTA algorithms such as Population Based Training (PBT), Vizier’s Median Stopping Rule, HyperBand/ASHA." • These redirect compute resources toward promising areas of search space 95 Infrastructure & Tooling - Hyperparameter Tuning
  • 96. Full Stack Deep Learning - UC Berkeley Spring 2021 96 Infrastructure & Tooling - Hyperparameter Tuning • What we use in Lab
  • 97. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 97
  • 98. Full Stack Deep Learning - UC Berkeley Spring 2021 “All-in-one” Training/Evaluation Deployment or or Data Web Versioning Monitoring CI / Testing Exploration Sources Hyperparameter Tuning Processing Compute Labeling Software Engineering Edge Resource Management Experiment Management Data Lake / Warehouse Feature Store Frameworks & Distributed Training Infrastructure & Tooling - All-in-one
  • 99. Full Stack Deep Learning - UC Berkeley Spring 2021 All-in-one Solutions • Single system for everything • development (hosted notebook) • scaling experiments to many machines (sometimes even provisioning) • tracking experiments and versioning models • deploying models • monitoring performance 99 Infrastructure & Tooling - All-in-one
  • 100. Full Stack Deep Learning - UC Berkeley Spring 2021 100 Infrastructure & Tooling - All-in-one
  • 101. Full Stack Deep Learning - UC Berkeley Spring 2021 101 Infrastructure & Tooling - All-in-one
  • 102. Full Stack Deep Learning - UC Berkeley Spring 2021 102 https://www.slideshare.net/AmazonWebServices/build-train-and-deploy-machine-learning-models-at-scale Infrastructure & Tooling - All-in-one
  • 103. Full Stack Deep Learning - UC Berkeley Spring 2021 103 Infrastructure & Tooling - All-in-one
  • 104. Full Stack Deep Learning - UC Berkeley Spring 2021 104 Infrastructure & Tooling - All-in-one
  • 105. Full Stack Deep Learning - UC Berkeley Spring 2021 105 Infrastructure & Tooling - All-in-one
  • 106. Full Stack Deep Learning - UC Berkeley Spring 2021 106 Infrastructure & Tooling - All-in-one • Open source! • In-house distributed training module • HyperBand based hyperparameter tuning • Smart GPU scheduling, including spot instances • Experiment tracking
  • 107. Full Stack Deep Learning - UC Berkeley Spring 2021 Domino Data Lab 107 Provision compute Track experiments Infrastructure & Tooling - All-in-one
  • 108. Full Stack Deep Learning - UC Berkeley Spring 2021 108 Domino Data Lab Deploy REST API Monitor predictions Publish applets Infrastructure & Tooling - All-in-one
  • 109. Full Stack Deep Learning - UC Berkeley Spring 2021 109 Domino Data Lab Monitor spend All projects in one place Infrastructure & Tooling - All-in-one
  • 110. Full Stack Deep Learning - UC Berkeley Spring 2021 Natural place to go for most MLOps companies 110
  • 111. Full Stack Deep Learning - UC Berkeley Spring 2021 111 W&B Paperspace Gradient Floyd Determined.ai Domino Data Lab Amazon SageMaker GC ML Engine Hardware N/A Paperspace GCP Agnostic Agnostic AWS GCP Resource Management No Yes Yes Yes Yes Yes Yes Hyperparam Optimization Yes No No Yes Yes Yes Yes Storing Models Yes Yes Yes Yes Yes Yes Yes Reviewing Experiments Yes Yes Yes Yes Yes Yes Yes Deploying Models as REST API No No Yes No Yes Yes Yes Monitoring No No No No Yes Yes Yes Infrastructure & Tooling - All-in-one
  • 112. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 112
  • 113. Full Stack Deep Learning - UC Berkeley Spring 2021 Tooling Tuesdays 113 https://twitter.com/full_stack_dl
  • 114. Full Stack Deep Learning - UC Berkeley Spring 2021 Thank you! 114