SigOpt. Confidential.
Machine Learning Infrastructure
Alexandra Johnson
alexandra@sigopt.com @alexandraj777
SigOpt. Confidential.
Alexandra Johnson
Tech Lead, Platform Team
SigOpt. Confidential.
Let's Build a Data Science Team!
• Who do we hire?
• What do we ask them to do?
• What does success look like?
3
SigOpt. Confidential.
Let's Build a Data Science Team!
Call out your answers!
• Who do we hire?
• Statisticians
• PhDs in science / math related fields
• People interested in building models!
• What do we ask them to do?
• Gather data
• Build models
• Extract insights
• What does success look like?
• ML models driving business decisions
4
SigOpt. Confidential.
A Data Scientist Wants to Build a Model
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Train model
5. Analyze results
5
A typical model-building
workflow for a data scientist
working in a local
development environment,
such as their work laptop
SigOpt. Confidential.
A Data Scientist Wants to Build a Model
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Train model
5. Analyze results
6
Out of memory!
Out of memory errors could
occur for a number of
reasons, including:
• data set too large
• features too large
• model too large
SigOpt. Confidential.
In addition to memory
concerns, here are some
additional reasons why a data
scientist might not be able to
train their model in their local
development environment:
• High degree of
parallelism
• Specialized hardware
(GPUs)
• Don't want to
monopolize laptop
resources
New Model Building Workflow
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Spin up AWS EC2 instance
5. Setup machine
6. Launch training job
7. Analyze results
7
SigOpt. Confidential.
New Model Building Workflow
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Spin up AWS EC2 instance
5. Setup machine
6. Launch training job
7. Analyze results
8
Data Science work
In the new workflow, only half
of the work relates to the
data scientist's specialty
SigOpt. Confidential.
New Model Building Workflow
1. Gather data
2. Feature extraction
3. Pick ML framework
4. Spin up AWS EC2 instance
5. Setup machine
6. Launch training job
7. Analyze results
9
Infrastructure work
Half of the work here is
infrastructure work, which is a
separate field of engineering
Writing code to spin up AWS
EC2 instances is very
different from the team's
original goal of "ML models
driving business decisions"
SigOpt. Confidential.
• Want to close your laptop without accidentally stopping your model
training
• Large datasets / features / models
• Specialized hardware (GPUs)
• High degree of parallelism helps projects finish faster
• Large teams pool access to compute resources to save money
When the Need for Infrastructure Scales Up
i.e. Is it really a big deal that a data scientist is ssh'ing into one EC2 instance?
10
SigOpt. Confidential.
SigOpt
Who is responsible for spinning up and managing
the data scientist's infrastructure?
SigOpt. Confidential.
Traditional Infrastructure Teams
• Who do we hire?
• What do we ask them to do?
• What does success look like?
12
SigOpt. Confidential.
• Who do we hire?
• Systems experts
• Backend engineers
• People who love reliability and scalability!
• What do we ask them to do?
• Reliability
• Scalability
• Performance
• What does success look like?
• 99.99% uptime of API
• 99.99% uptime of website
• No data loss
Traditional Infrastructure Teams
Call out your answers!
13
SigOpt. Confidential.
SigOpt
The data science team feels the pain, but the
infrastructure team has pre-existing objectives
SigOpt. Confidential.15
Machine Learning Infrastructure
Data science users
/ workloads
Infrastructure /
devops tools+ = Machine learning
infrastructure
SigOpt. Confidential.
Case Studies
SigOpt. Confidential.
Example: Hyperparameter Optimization
What is hyperparameter optimization?
• Every model has hyperparameters, aka configurations that you set
before you train the model
• Different settings of hyperparameters product different levels of model
performance
• Hyperparameter optimization (HPO) is the search for the set of
hyperparameters that produces the best model performance
17
SigOpt. Confidential.
Example: Hyperparameter Optimization
Example hyperparameters
• Random Forest (sklearn)
• Number of trees in a forest
• Maximum depth per tree
• Elastic Net (sklearn)
• Regularization coefficient
• Weight of the l1 norm term
• Deep Learning Models (MXNet, TensorFlow, PyTorch)
• Learning rate
• Number of hidden layers
18
SigOpt. Confidential.
Example: Hyperparameter Optimization
19
• 100 configurations of
hyperparameters x 1 hour of
training time ≈ 4 days
• Start job Monday at noon,
check results Friday at noon
• On the order of one week
Parallelism reduces wall clock time
• 100 configurations of
hyperparameters / 6 machines
x 1 hour training time ≈ 17
hours
• Start job Monday at noon,
check results Tuesday morning
• On the order of one day
SigOpt. Confidential.
• In 2017, every new machine learning project at SigOpt produced new a
new machine learning infrastructure tool
• Code to launch HPO projects was never the primary focus of the project
• Case studies here cover common architecture choices seen among at
least four tools
20
Case Study: Data Scientist Build Incrementally
SigOpt. Confidential.
Problem: Setup code and
dependencies on each remote
machine
Solution: Use scp to send data,
code, and setup script from local
environment to every remote
machine
21
Data Scientist: Setup Machines
SigOpt. Confidential.
Problem: Start training ML
model on each remote
instance
Solution: Use ssh to run
commands on remote
instances
22
Data Scientist: Launch Job
SigOpt. Confidential.
Problem: View the status of a job
at a glance
Solution: Rely on third-party
APIs to track metadata, run ML
training processes in tmux so
logs can be viewed later
23
Data Scientist: View Progress and Debug
SigOpt. Confidential.
• Simple design
• Data scientist has full
understanding of their tool
• Data scientist has full control
over their tool
• No external dependencies to
build features or fix bugs
Data Science Solution: Pros and Cons
24
Pros Cons
• Few debugging tools
• Decentralized logs
• Not scalable
• Closing laptop during
long-running commands loses
progress
• Difficult to set
organization-level standards
SigOpt. Confidential.
SigOpt
"Creating shared services also creates
dependencies and can impinge on autonomy"
- Marty Cagan, Inspired
SigOpt. Confidential.
• Infrastructure engineers started a dedicated effort to build tools for
launching HPO jobs in 2018
• Viewed as an overhaul of previous infrastructure managment tools
• Resulting product was SigOpt Orchestrate
26
Case Study: Infrastructure Engineer Overhaul
SigOpt. Confidential.
Problem: Setup code and
dependencies on each remote
machine
Solution: Use Docker to
containerize model
development environment
27
Infrastructure Engineer: Setup Machines
Registry
SigOpt. Confidential.
Problem: Start training ML
model on each remote instance
Solution: Use Kubernetes to
provide a uniform interface to
the cluster
28
Infrastructure Engineer: Launch Job
SigOpt. Confidential.
Problem: View the status of a job
at a glance
Solution: Build a command line
interface (CLI) that abstracts
away infrastructure tools
29
Infrastructure Engineer: View Progress and Debug
SigOpt. Confidential.
• Pre-existing APIs lead to rapid
feature development
• Debugging tools
• Highly scalable
• User can close laptop and job
still runs
• Easy to install
Infrastructure Engineer Solution: Pros and Cons
• Data scientist may not
understand underlying
technologies (Docker and
Kubernetes)
• External dependency on
infrastructure team to build
new features and fix bugs
• Difficult to onboard
30
Pros Cons
SigOpt. Confidential.
Looking Forwards
SigOpt. Confidential.
SigOpt
Machine Learning Infrastructure requires a tight user
feedback loop
SigOpt. Confidential.
ML Infrastructure Within Large Companies
33
• Google's Borg
• Uber's Michelangelo
• AirBnb's BigHead
• Lyft's ML Platform
SigOpt. Confidential.
• Polyaxon
• Kubeflow
• MLFlow
Open Source ML Infrastructure Projects
34
SigOpt. Confidential.
Further Reading
35
• Paper: Orchestrate: Infrastructure for Enabling Parallelism during
Hyperparameter Optimization https://arxiv.org/abs/1812.07751
• Blog Post: Machine Learning Infrastructure Tools for Hyperparameter
Optimization
https://sigopt.com/blog/machine-learning-infrastructure-tools-for-hyperp
arameter-optimization/
• Talk: Reducing Operational Barriers to Model Training
https://mlconf.com/sessions/reducing-operational-barriers-to-model-trai
ning/
SigOpt. Confidential.
• Data scientists built tools that were brittle, but allowed for great freedom
• Infrastructure engineers built tools that suffered usability issues
• Successful teams will have a tight feedback loop between infrastructure
engineers and data science users
Takeaways
36
SigOpt. Confidential.
I Want to Learn From You!
I'm around Ann Arbor until about 5pm tomorrow!
I'd love to stop by your office and learn about your work in data science / ML
Email alexandra@sigopt.com or talk to me right after this to setup a time
37
SigOpt. Confidential.
Thank you!
Any questions?
Alexandra Johnson
alexandra@sigopt.com @alexandraj777

Machine Learning Infrastructure

  • 1.
    SigOpt. Confidential. Machine LearningInfrastructure Alexandra Johnson alexandra@sigopt.com @alexandraj777
  • 2.
  • 3.
    SigOpt. Confidential. Let's Builda Data Science Team! • Who do we hire? • What do we ask them to do? • What does success look like? 3
  • 4.
    SigOpt. Confidential. Let's Builda Data Science Team! Call out your answers! • Who do we hire? • Statisticians • PhDs in science / math related fields • People interested in building models! • What do we ask them to do? • Gather data • Build models • Extract insights • What does success look like? • ML models driving business decisions 4
  • 5.
    SigOpt. Confidential. A DataScientist Wants to Build a Model 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Train model 5. Analyze results 5 A typical model-building workflow for a data scientist working in a local development environment, such as their work laptop
  • 6.
    SigOpt. Confidential. A DataScientist Wants to Build a Model 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Train model 5. Analyze results 6 Out of memory! Out of memory errors could occur for a number of reasons, including: • data set too large • features too large • model too large
  • 7.
    SigOpt. Confidential. In additionto memory concerns, here are some additional reasons why a data scientist might not be able to train their model in their local development environment: • High degree of parallelism • Specialized hardware (GPUs) • Don't want to monopolize laptop resources New Model Building Workflow 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Spin up AWS EC2 instance 5. Setup machine 6. Launch training job 7. Analyze results 7
  • 8.
    SigOpt. Confidential. New ModelBuilding Workflow 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Spin up AWS EC2 instance 5. Setup machine 6. Launch training job 7. Analyze results 8 Data Science work In the new workflow, only half of the work relates to the data scientist's specialty
  • 9.
    SigOpt. Confidential. New ModelBuilding Workflow 1. Gather data 2. Feature extraction 3. Pick ML framework 4. Spin up AWS EC2 instance 5. Setup machine 6. Launch training job 7. Analyze results 9 Infrastructure work Half of the work here is infrastructure work, which is a separate field of engineering Writing code to spin up AWS EC2 instances is very different from the team's original goal of "ML models driving business decisions"
  • 10.
    SigOpt. Confidential. • Wantto close your laptop without accidentally stopping your model training • Large datasets / features / models • Specialized hardware (GPUs) • High degree of parallelism helps projects finish faster • Large teams pool access to compute resources to save money When the Need for Infrastructure Scales Up i.e. Is it really a big deal that a data scientist is ssh'ing into one EC2 instance? 10
  • 11.
    SigOpt. Confidential. SigOpt Who isresponsible for spinning up and managing the data scientist's infrastructure?
  • 12.
    SigOpt. Confidential. Traditional InfrastructureTeams • Who do we hire? • What do we ask them to do? • What does success look like? 12
  • 13.
    SigOpt. Confidential. • Whodo we hire? • Systems experts • Backend engineers • People who love reliability and scalability! • What do we ask them to do? • Reliability • Scalability • Performance • What does success look like? • 99.99% uptime of API • 99.99% uptime of website • No data loss Traditional Infrastructure Teams Call out your answers! 13
  • 14.
    SigOpt. Confidential. SigOpt The datascience team feels the pain, but the infrastructure team has pre-existing objectives
  • 15.
    SigOpt. Confidential.15 Machine LearningInfrastructure Data science users / workloads Infrastructure / devops tools+ = Machine learning infrastructure
  • 16.
  • 17.
    SigOpt. Confidential. Example: HyperparameterOptimization What is hyperparameter optimization? • Every model has hyperparameters, aka configurations that you set before you train the model • Different settings of hyperparameters product different levels of model performance • Hyperparameter optimization (HPO) is the search for the set of hyperparameters that produces the best model performance 17
  • 18.
    SigOpt. Confidential. Example: HyperparameterOptimization Example hyperparameters • Random Forest (sklearn) • Number of trees in a forest • Maximum depth per tree • Elastic Net (sklearn) • Regularization coefficient • Weight of the l1 norm term • Deep Learning Models (MXNet, TensorFlow, PyTorch) • Learning rate • Number of hidden layers 18
  • 19.
    SigOpt. Confidential. Example: HyperparameterOptimization 19 • 100 configurations of hyperparameters x 1 hour of training time ≈ 4 days • Start job Monday at noon, check results Friday at noon • On the order of one week Parallelism reduces wall clock time • 100 configurations of hyperparameters / 6 machines x 1 hour training time ≈ 17 hours • Start job Monday at noon, check results Tuesday morning • On the order of one day
  • 20.
    SigOpt. Confidential. • In2017, every new machine learning project at SigOpt produced new a new machine learning infrastructure tool • Code to launch HPO projects was never the primary focus of the project • Case studies here cover common architecture choices seen among at least four tools 20 Case Study: Data Scientist Build Incrementally
  • 21.
    SigOpt. Confidential. Problem: Setupcode and dependencies on each remote machine Solution: Use scp to send data, code, and setup script from local environment to every remote machine 21 Data Scientist: Setup Machines
  • 22.
    SigOpt. Confidential. Problem: Starttraining ML model on each remote instance Solution: Use ssh to run commands on remote instances 22 Data Scientist: Launch Job
  • 23.
    SigOpt. Confidential. Problem: Viewthe status of a job at a glance Solution: Rely on third-party APIs to track metadata, run ML training processes in tmux so logs can be viewed later 23 Data Scientist: View Progress and Debug
  • 24.
    SigOpt. Confidential. • Simpledesign • Data scientist has full understanding of their tool • Data scientist has full control over their tool • No external dependencies to build features or fix bugs Data Science Solution: Pros and Cons 24 Pros Cons • Few debugging tools • Decentralized logs • Not scalable • Closing laptop during long-running commands loses progress • Difficult to set organization-level standards
  • 25.
    SigOpt. Confidential. SigOpt "Creating sharedservices also creates dependencies and can impinge on autonomy" - Marty Cagan, Inspired
  • 26.
    SigOpt. Confidential. • Infrastructureengineers started a dedicated effort to build tools for launching HPO jobs in 2018 • Viewed as an overhaul of previous infrastructure managment tools • Resulting product was SigOpt Orchestrate 26 Case Study: Infrastructure Engineer Overhaul
  • 27.
    SigOpt. Confidential. Problem: Setupcode and dependencies on each remote machine Solution: Use Docker to containerize model development environment 27 Infrastructure Engineer: Setup Machines Registry
  • 28.
    SigOpt. Confidential. Problem: Starttraining ML model on each remote instance Solution: Use Kubernetes to provide a uniform interface to the cluster 28 Infrastructure Engineer: Launch Job
  • 29.
    SigOpt. Confidential. Problem: Viewthe status of a job at a glance Solution: Build a command line interface (CLI) that abstracts away infrastructure tools 29 Infrastructure Engineer: View Progress and Debug
  • 30.
    SigOpt. Confidential. • Pre-existingAPIs lead to rapid feature development • Debugging tools • Highly scalable • User can close laptop and job still runs • Easy to install Infrastructure Engineer Solution: Pros and Cons • Data scientist may not understand underlying technologies (Docker and Kubernetes) • External dependency on infrastructure team to build new features and fix bugs • Difficult to onboard 30 Pros Cons
  • 31.
  • 32.
    SigOpt. Confidential. SigOpt Machine LearningInfrastructure requires a tight user feedback loop
  • 33.
    SigOpt. Confidential. ML InfrastructureWithin Large Companies 33 • Google's Borg • Uber's Michelangelo • AirBnb's BigHead • Lyft's ML Platform
  • 34.
    SigOpt. Confidential. • Polyaxon •Kubeflow • MLFlow Open Source ML Infrastructure Projects 34
  • 35.
    SigOpt. Confidential. Further Reading 35 •Paper: Orchestrate: Infrastructure for Enabling Parallelism during Hyperparameter Optimization https://arxiv.org/abs/1812.07751 • Blog Post: Machine Learning Infrastructure Tools for Hyperparameter Optimization https://sigopt.com/blog/machine-learning-infrastructure-tools-for-hyperp arameter-optimization/ • Talk: Reducing Operational Barriers to Model Training https://mlconf.com/sessions/reducing-operational-barriers-to-model-trai ning/
  • 36.
    SigOpt. Confidential. • Datascientists built tools that were brittle, but allowed for great freedom • Infrastructure engineers built tools that suffered usability issues • Successful teams will have a tight feedback loop between infrastructure engineers and data science users Takeaways 36
  • 37.
    SigOpt. Confidential. I Wantto Learn From You! I'm around Ann Arbor until about 5pm tomorrow! I'd love to stop by your office and learn about your work in data science / ML Email alexandra@sigopt.com or talk to me right after this to setup a time 37
  • 38.
    SigOpt. Confidential. Thank you! Anyquestions? Alexandra Johnson alexandra@sigopt.com @alexandraj777