Machine Learning Infrastructure

SigOpt. Conﬁdential.
Machine Learning Infrastructure
Alexandra Johnson
alexandra@sigopt.com @alexandraj777

Alexandra Johnson
Tech Lead, Platform Team

Let's Build a Data Science Team!
• Who do we hire?
• What do we ask them to do?
• What does success look like?
3

Let's Build a Data Science Team!
Call out your answers!
• Who do we hire?
• Statisticians
• PhDs in science / math related ﬁelds
• People interested in building models!
• Gather data
• Build models
• Extract insights
• ML models driving business decisions
4

A Data Scientist Wants to Build a Model
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Train model
5. Analyze results
5
A typical model-building
workﬂow for a data scientist
working in a local
development environment,
such as their work laptop

A Data Scientist Wants to Build a Model
1. Gather data
4. Train model
5. Analyze results
6
Out of memory!
Out of memory errors could
occur for a number of
reasons, including:
• data set too large
• features too large
• model too large

In addition to memory
concerns, here are some
additional reasons why a data
scientist might not be able to
train their model in their local
development environment:
• High degree of
parallelism
• Specialized hardware
(GPUs)
• Don't want to
monopolize laptop
resources
New Model Building Workﬂow
1. Gather data
4. Spin up AWS EC2 instance
5. Setup machine
6. Launch training job
7. Analyze results
7

1. Gather data
5. Setup machine
7. Analyze results
8
Data Science work
In the new workﬂow, only half
of the work relates to the
data scientist's specialty

1. Gather data
5. Setup machine
7. Analyze results
9
Infrastructure work
Half of the work here is
infrastructure work, which is a
separate ﬁeld of engineering
Writing code to spin up AWS
EC2 instances is very
diﬀerent from the team's
original goal of "ML models
driving business decisions"

• Want to close your laptop without accidentally stopping your model
training
• Large datasets / features / models
• Specialized hardware (GPUs)
• High degree of parallelism helps projects ﬁnish faster
• Large teams pool access to compute resources to save money
When the Need for Infrastructure Scales Up
i.e. Is it really a big deal that a data scientist is ssh'ing into one EC2 instance?
10

SigOpt
Who is responsible for spinning up and managing
the data scientist's infrastructure?

Traditional Infrastructure Teams
• Who do we hire?
12

• Who do we hire?
• Systems experts
• Backend engineers
• People who love reliability and scalability!
• Reliability
• Scalability
• Performance
• 99.99% uptime of API
• 99.99% uptime of website
• No data loss
Traditional Infrastructure Teams
Call out your answers!
13

SigOpt
The data science team feels the pain, but the
infrastructure team has pre-existing objectives

SigOpt. Conﬁdential.15
Machine Learning Infrastructure
Data science users
/ workloads
Infrastructure /
devops tools+ = Machine learning
infrastructure

Case Studies

Example: Hyperparameter Optimization
What is hyperparameter optimization?
• Every model has hyperparameters, aka configurations that you set
before you train the model
• Different settings of hyperparameters product different levels of model
performance
• Hyperparameter optimization (HPO) is the search for the set of
hyperparameters that produces the best model performance
17

Example hyperparameters
• Random Forest (sklearn)
• Number of trees in a forest
• Maximum depth per tree
• Elastic Net (sklearn)
• Regularization coeﬃcient
• Weight of the l1 norm term
• Deep Learning Models (MXNet, TensorFlow, PyTorch)
• Learning rate
• Number of hidden layers
18

19
• 100 conﬁgurations of
hyperparameters x 1 hour of
training time ≈ 4 days
• Start job Monday at noon,
check results Friday at noon
• On the order of one week
Parallelism reduces wall clock time
• 100 conﬁgurations of
hyperparameters / 6 machines
x 1 hour training time ≈ 17
hours
• Start job Monday at noon,
check results Tuesday morning
• On the order of one day

• In 2017, every new machine learning project at SigOpt produced new a
new machine learning infrastructure tool
• Code to launch HPO projects was never the primary focus of the project
• Case studies here cover common architecture choices seen among at
least four tools
20
Case Study: Data Scientist Build Incrementally

Problem: Setup code and
dependencies on each remote
machine
Solution: Use scp to send data,
code, and setup script from local
environment to every remote
machine
21
Data Scientist: Setup Machines

Problem: Start training ML
model on each remote
instance
Solution: Use ssh to run
commands on remote
instances
22
Data Scientist: Launch Job

Problem: View the status of a job
at a glance
Solution: Rely on third-party
APIs to track metadata, run ML
training processes in tmux so
logs can be viewed later
23
Data Scientist: View Progress and Debug

• Simple design
• Data scientist has full
understanding of their tool
• Data scientist has full control
over their tool
• No external dependencies to
build features or ﬁx bugs
Data Science Solution: Pros and Cons
24
Pros Cons
• Few debugging tools
• Decentralized logs
• Not scalable
• Closing laptop during
long-running commands loses
progress
• Diﬃcult to set
organization-level standards

SigOpt
"Creating shared services also creates
dependencies and can impinge on autonomy"
- Marty Cagan, Inspired

• Infrastructure engineers started a dedicated eﬀort to build tools for
launching HPO jobs in 2018
• Viewed as an overhaul of previous infrastructure managment tools
• Resulting product was SigOpt Orchestrate
26
Case Study: Infrastructure Engineer Overhaul

Problem: Setup code and
dependencies on each remote
machine
Solution: Use Docker to
containerize model
development environment
27
Infrastructure Engineer: Setup Machines
Registry

Problem: Start training ML
model on each remote instance
Solution: Use Kubernetes to
provide a uniform interface to
the cluster
28
Infrastructure Engineer: Launch Job

Problem: View the status of a job
at a glance
Solution: Build a command line
interface (CLI) that abstracts
away infrastructure tools
29
Infrastructure Engineer: View Progress and Debug

• Pre-existing APIs lead to rapid
feature development
• Debugging tools
• Highly scalable
• User can close laptop and job
still runs
• Easy to install
Infrastructure Engineer Solution: Pros and Cons
• Data scientist may not
understand underlying
technologies (Docker and
Kubernetes)
• External dependency on
infrastructure team to build
new features and ﬁx bugs
• Diﬃcult to onboard
30
Pros Cons

Looking Forwards

SigOpt
Machine Learning Infrastructure requires a tight user
feedback loop

ML Infrastructure Within Large Companies
33
• Google's Borg
• Uber's Michelangelo
• AirBnb's BigHead
• Lyft's ML Platform

• Polyaxon
• Kubeﬂow
• MLFlow
Open Source ML Infrastructure Projects
34

Further Reading
35
• Paper: Orchestrate: Infrastructure for Enabling Parallelism during
Hyperparameter Optimization https://arxiv.org/abs/1812.07751
• Blog Post: Machine Learning Infrastructure Tools for Hyperparameter
Optimization
https://sigopt.com/blog/machine-learning-infrastructure-tools-for-hyperp
arameter-optimization/
• Talk: Reducing Operational Barriers to Model Training
https://mlconf.com/sessions/reducing-operational-barriers-to-model-trai
ning/

• Data scientists built tools that were brittle, but allowed for great freedom
• Infrastructure engineers built tools that suﬀered usability issues
• Successful teams will have a tight feedback loop between infrastructure
engineers and data science users
Takeaways
36

I Want to Learn From You!
I'm around Ann Arbor until about 5pm tomorrow!
I'd love to stop by your oﬃce and learn about your work in data science / ML
Email alexandra@sigopt.com or talk to me right after this to setup a time
37

Thank you!
Any questions?
Alexandra Johnson
alexandra@sigopt.com @alexandraj777

Machine Learning Infrastructure

More Related Content

What's hot

Similar to Machine Learning Infrastructure

More from SigOpt

Recently uploaded

Machine Learning Infrastructure