Machine Learning Infrastructure

ML Infrastructure
alexandra@sigopt.com | @alexandraj777

About Me
● Alum of Carnegie Mellon SCS
● Joined SigOpt in 2015
● Tech lead for the Platform Team, handling
frontend, backend, infrastructure and
testing
● Recent project in ML Infrastructure: SigOpt
Orchestrate
● Co-organizer for Bay Area chapter of
Women in Machine Learning and Data
Science (join us!)

ML Infrastructure
solves data scientists'
problems using
infrastructure tools

Challenge:
● Data scientists want to maximize the
performance of their models
● SigOpt provides an API for
hyperparameter optimization (HPO)
● SigOpt HPO helps data scientists
maximize the performance of their models!
● Data scientists need to use clusters to
properly perform HPO
Machine Learning Infrastructure

Challenge:
● Data scientists want to maximize the
performance of their models
● SigOpt provides an API for
hyperparameter optimization (HPO)
● SigOpt HPO helps data scientists
maximize the performance of their models!
● Data scientists need to use clusters to
properly perform HPO
Machine Learning Infrastructure
Data scientists specialize in:
● Gathering data
● Building models
● Extracting business insights
Infrastructure engineers specialize in:
● Building shared tools
● Application scalability and performance
● Keeping track of interactions between
large distributed systems

Case Study: Building SigOpt Orchestrate
● Project started in 2018 to bridge ML and
infrastructure
● What problems did our customers ask us to
solve?
● How did a challenge for the user turn into a
technical problem?
● Which tools / technologies did we use?

Challenge #1: Can't Train Model on Laptop
Problem: Setup each remote machine
Initial Solution:
● Write a setup script to install
dependencies
● SCP data, code, and setup script to every
remote machine

Solution #1: Containerize!
Problem: Setup each remote machine
New Solution:
● Containerize code and dependencies on
the user's local environment
● Push the container to a registry
● Each machine pulls the container from a
registry

Challenge #2: Start Training in Parallel
Problem: Kick off the hyperparameter
optimization job on six machines at once
Initial Solution:
● Open a tmux window on every remote
instance
● SSH over command to run setup script
into each tmux window
● SSH over command to train model into
each tmux window

Solution #2: Kubernetes!
Problem: Kick off the hyperparameter
optimization job on six machines at once
New Solution:
● Spin up AWS EKS (Kubernetes) cluster
● Create a job spec
○ "run 6 copies of this container at the same
time"
● Submit job spec to Kubernetes API
● Kubernetes starts the job on the cluster

Challenge #3: View Progress and Debug
Problem: View the status of a hyperparameter
optimization job at a glance
Initial Solution:
● Save hostname and error information as
metadata in calls to external API
● SSH into machines and view the logs
directly (pre-Kubernetes)
● Use Kubernetes CLI to view logs

Solution #3: Build a CLI!
Problem: View the status of a hyperparameter
optimization job at a glance
New Solution:
● Write an interface for the data scientist to
interact with the infrastructure tool
● We chose a command line interface
● Serves as an abstraction on top of
Kubernetes APIs + externals APIs
● Screenshots (top and bottom)
○ sigopt logs <experiment_id>
○ sigopt status <experiment_id>

Final Thoughts...
Paper: Orchestrate: Infrastructure for Enabling
Parallelism during Hyperparameter Optimization,
Alexandra Johnson and Michael McCourt
SigOpt is free for academics!
We're hiring research engineers/interns and
software engineers/interns!

Machine Learning Infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine Learning Infrastructure

Similar to Machine Learning Infrastructure (20)

More from SigOpt

More from SigOpt (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Infrastructure