In this talk at MLconf NYC, Alexandra Johnson, platform engineering lead at SigOpt, discusses common operational challenges with scaling model training and how solutions are designed to
3. SigOpt. Confidential.3
Operational Barriers
Machine learning experts specialize in:
• Gathering data
• Building models
• Extracting insights
Infrastructure engineers specialize in:
• Building shared tools
• Application scalability and performance
• Keeping track of interactions between large
distributed systems
The Challenge:
• Machine learning experts want to maximize
the performance of their models
• SigOpt provides an API for hyperparameter
optimization (HPO)
• SigOpt HPO helps ML experts maximize the
performance of their models!
• ML experts need to use clusters to properly
perform HPO
5. SigOpt. Confidential.5
Case Study: Building SigOpt Orchestrate
• Project started in 2018 to bridge ML and infrastructure
• What problems did our customers ask us to solve?
• How did a challenge for the user turn into a technical problem?
• Which tools / technologies did we use?
6. SigOpt. Confidential.6
Challenge #1: Can't Train Model on Laptop
Problem: Setup each remote machine
Initial Solution:
• Write a setup script to install dependencies
• SCP data, code, and setup script to every
remote machine
7. SigOpt. Confidential.7
Solution #1: Containerize!
Problem: Setup each remote machine
New Solution:
• Containerize code and dependencies
on the user's local environment
• Push the image to a registry
• Each machine pulls the image from a
registry
Registry
8. SigOpt. Confidential.8
Challenge #2: Start Training in Parallel
Problem: Kick off the hyperparameter
optimization job on six machines at once
Initial Solution:
• Open a tmux window on every
remote instance
• SSH over command to run setup
script into each tmux window
• SSH over command to train model
into each tmux window
9. SigOpt. Confidential.9
Solution #2: Kubernetes!
Problem: Kick off the hyperparameter
optimization job on six machines at once
New Solution:
• Spin up AWS EKS (Kubernetes) cluster
• Create a job spec
• "run 6 copies of this container at the
same time"
• Submit job spec to Kubernetes API
• Kubernetes starts the job on the cluster
10. SigOpt. Confidential.10
Challenge #3: View Progress and Debug
Problem: View the status of a hyperparameter
optimization job at a glance
Initial Solution:
• Save hostname and error information as
metadata in calls to external API
• SSH into machines and view the logs
directly (pre-Kubernetes)
• Use Kubernetes CLI to view logs
11. SigOpt. Confidential.11
Solution #3: Build a CLI!
Problem: View the status of a hyperparameter
optimization job at a glance
New Solution:
• Write an interface for the data scientist to
interact with the infrastructure tool
• We chose a command line interface
• Serves as an abstraction on top of
Kubernetes APIs + externals APIs
• Screenshots (top and bottom)
○ sigopt logs <experiment_id>
○ sigopt status <experiment_id>
12. SigOpt. Confidential.12
Final Thoughts...
• We're hiring!
• Connect with us
• Paper: Orchestrate: Infrastructure for Enabling Parallelism
during Hyperparameter Optimization, Alexandra Johnson
and Michael McCourt