SigOpt. Confidential.
Reducing Operational Barriers
to Model Training
Alexandra Johnson
alexandra@sigopt.com @alexandraj777
SigOpt. Confidential.
Alexandra Johnson
Software Engineer
SigOpt. Confidential.3
Operational Barriers
Machine learning experts specialize in:
• Gathering data
• Building models
• Extracting insights
Infrastructure engineers specialize in:
• Building shared tools
• Application scalability and performance
• Keeping track of interactions between large
distributed systems
The Challenge:
• Machine learning experts want to maximize
the performance of their models
• SigOpt provides an API for hyperparameter
optimization (HPO)
• SigOpt HPO helps ML experts maximize the
performance of their models!
• ML experts need to use clusters to properly
perform HPO
SigOpt. Confidential.4
Machine Learning Infrastructure
Model building
workflow problems
Infrastructure /
devops solutions+ = Machine learning
infrastructure
SigOpt. Confidential.5
Case Study: Building SigOpt Orchestrate
• Project started in 2018 to bridge ML and infrastructure
• What problems did our customers ask us to solve?
• How did a challenge for the user turn into a technical problem?
• Which tools / technologies did we use?
SigOpt. Confidential.6
Challenge #1: Can't Train Model on Laptop
Problem: Setup each remote machine
Initial Solution:
• Write a setup script to install dependencies
• SCP data, code, and setup script to every
remote machine
SigOpt. Confidential.7
Solution #1: Containerize!
Problem: Setup each remote machine
New Solution:
• Containerize code and dependencies
on the user's local environment
• Push the image to a registry
• Each machine pulls the image from a
registry
Registry
SigOpt. Confidential.8
Challenge #2: Start Training in Parallel
Problem: Kick off the hyperparameter
optimization job on six machines at once
Initial Solution:
• Open a tmux window on every
remote instance
• SSH over command to run setup
script into each tmux window
• SSH over command to train model
into each tmux window
SigOpt. Confidential.9
Solution #2: Kubernetes!
Problem: Kick off the hyperparameter
optimization job on six machines at once
New Solution:
• Spin up AWS EKS (Kubernetes) cluster
• Create a job spec
• "run 6 copies of this container at the
same time"
• Submit job spec to Kubernetes API
• Kubernetes starts the job on the cluster
SigOpt. Confidential.10
Challenge #3: View Progress and Debug
Problem: View the status of a hyperparameter
optimization job at a glance
Initial Solution:
• Save hostname and error information as
metadata in calls to external API
• SSH into machines and view the logs
directly (pre-Kubernetes)
• Use Kubernetes CLI to view logs
SigOpt. Confidential.11
Solution #3: Build a CLI!
Problem: View the status of a hyperparameter
optimization job at a glance
New Solution:
• Write an interface for the data scientist to
interact with the infrastructure tool
• We chose a command line interface
• Serves as an abstraction on top of
Kubernetes APIs + externals APIs
• Screenshots (top and bottom)
○ sigopt logs <experiment_id>
○ sigopt status <experiment_id>
SigOpt. Confidential.12
Final Thoughts...
• We're hiring!
• Connect with us
• Paper: Orchestrate: Infrastructure for Enabling Parallelism
during Hyperparameter Optimization, Alexandra Johnson
and Michael McCourt
SigOpt. Confidential.
Thank you!
Any questions?
Alexandra Johnson
alexandra@sigopt.com @alexandraj777

Alexandra johnson reducing operational barriers to model training

  • 1.
    SigOpt. Confidential. Reducing OperationalBarriers to Model Training Alexandra Johnson alexandra@sigopt.com @alexandraj777
  • 2.
  • 3.
    SigOpt. Confidential.3 Operational Barriers Machinelearning experts specialize in: • Gathering data • Building models • Extracting insights Infrastructure engineers specialize in: • Building shared tools • Application scalability and performance • Keeping track of interactions between large distributed systems The Challenge: • Machine learning experts want to maximize the performance of their models • SigOpt provides an API for hyperparameter optimization (HPO) • SigOpt HPO helps ML experts maximize the performance of their models! • ML experts need to use clusters to properly perform HPO
  • 4.
    SigOpt. Confidential.4 Machine LearningInfrastructure Model building workflow problems Infrastructure / devops solutions+ = Machine learning infrastructure
  • 5.
    SigOpt. Confidential.5 Case Study:Building SigOpt Orchestrate • Project started in 2018 to bridge ML and infrastructure • What problems did our customers ask us to solve? • How did a challenge for the user turn into a technical problem? • Which tools / technologies did we use?
  • 6.
    SigOpt. Confidential.6 Challenge #1:Can't Train Model on Laptop Problem: Setup each remote machine Initial Solution: • Write a setup script to install dependencies • SCP data, code, and setup script to every remote machine
  • 7.
    SigOpt. Confidential.7 Solution #1:Containerize! Problem: Setup each remote machine New Solution: • Containerize code and dependencies on the user's local environment • Push the image to a registry • Each machine pulls the image from a registry Registry
  • 8.
    SigOpt. Confidential.8 Challenge #2:Start Training in Parallel Problem: Kick off the hyperparameter optimization job on six machines at once Initial Solution: • Open a tmux window on every remote instance • SSH over command to run setup script into each tmux window • SSH over command to train model into each tmux window
  • 9.
    SigOpt. Confidential.9 Solution #2:Kubernetes! Problem: Kick off the hyperparameter optimization job on six machines at once New Solution: • Spin up AWS EKS (Kubernetes) cluster • Create a job spec • "run 6 copies of this container at the same time" • Submit job spec to Kubernetes API • Kubernetes starts the job on the cluster
  • 10.
    SigOpt. Confidential.10 Challenge #3:View Progress and Debug Problem: View the status of a hyperparameter optimization job at a glance Initial Solution: • Save hostname and error information as metadata in calls to external API • SSH into machines and view the logs directly (pre-Kubernetes) • Use Kubernetes CLI to view logs
  • 11.
    SigOpt. Confidential.11 Solution #3:Build a CLI! Problem: View the status of a hyperparameter optimization job at a glance New Solution: • Write an interface for the data scientist to interact with the infrastructure tool • We chose a command line interface • Serves as an abstraction on top of Kubernetes APIs + externals APIs • Screenshots (top and bottom) ○ sigopt logs <experiment_id> ○ sigopt status <experiment_id>
  • 12.
    SigOpt. Confidential.12 Final Thoughts... •We're hiring! • Connect with us • Paper: Orchestrate: Infrastructure for Enabling Parallelism during Hyperparameter Optimization, Alexandra Johnson and Michael McCourt
  • 13.
    SigOpt. Confidential. Thank you! Anyquestions? Alexandra Johnson alexandra@sigopt.com @alexandraj777