Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Alexandra johnson reducing operational barriers to model training

148 views

Published on

Reducing Operational Barriers to Model Training

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Alexandra johnson reducing operational barriers to model training

  1. 1. SigOpt. Confidential. Reducing Operational Barriers to Model Training Alexandra Johnson alexandra@sigopt.com @alexandraj777
  2. 2. SigOpt. Confidential. Alexandra Johnson Software Engineer
  3. 3. SigOpt. Confidential.3 Operational Barriers Machine learning experts specialize in: • Gathering data • Building models • Extracting insights Infrastructure engineers specialize in: • Building shared tools • Application scalability and performance • Keeping track of interactions between large distributed systems The Challenge: • Machine learning experts want to maximize the performance of their models • SigOpt provides an API for hyperparameter optimization (HPO) • SigOpt HPO helps ML experts maximize the performance of their models! • ML experts need to use clusters to properly perform HPO
  4. 4. SigOpt. Confidential.4 Machine Learning Infrastructure Model building workflow problems Infrastructure / devops solutions+ = Machine learning infrastructure
  5. 5. SigOpt. Confidential.5 Case Study: Building SigOpt Orchestrate • Project started in 2018 to bridge ML and infrastructure • What problems did our customers ask us to solve? • How did a challenge for the user turn into a technical problem? • Which tools / technologies did we use?
  6. 6. SigOpt. Confidential.6 Challenge #1: Can't Train Model on Laptop Problem: Setup each remote machine Initial Solution: • Write a setup script to install dependencies • SCP data, code, and setup script to every remote machine
  7. 7. SigOpt. Confidential.7 Solution #1: Containerize! Problem: Setup each remote machine New Solution: • Containerize code and dependencies on the user's local environment • Push the image to a registry • Each machine pulls the image from a registry Registry
  8. 8. SigOpt. Confidential.8 Challenge #2: Start Training in Parallel Problem: Kick off the hyperparameter optimization job on six machines at once Initial Solution: • Open a tmux window on every remote instance • SSH over command to run setup script into each tmux window • SSH over command to train model into each tmux window
  9. 9. SigOpt. Confidential.9 Solution #2: Kubernetes! Problem: Kick off the hyperparameter optimization job on six machines at once New Solution: • Spin up AWS EKS (Kubernetes) cluster • Create a job spec • "run 6 copies of this container at the same time" • Submit job spec to Kubernetes API • Kubernetes starts the job on the cluster
  10. 10. SigOpt. Confidential.10 Challenge #3: View Progress and Debug Problem: View the status of a hyperparameter optimization job at a glance Initial Solution: • Save hostname and error information as metadata in calls to external API • SSH into machines and view the logs directly (pre-Kubernetes) • Use Kubernetes CLI to view logs
  11. 11. SigOpt. Confidential.11 Solution #3: Build a CLI! Problem: View the status of a hyperparameter optimization job at a glance New Solution: • Write an interface for the data scientist to interact with the infrastructure tool • We chose a command line interface • Serves as an abstraction on top of Kubernetes APIs + externals APIs • Screenshots (top and bottom) ○ sigopt logs <experiment_id> ○ sigopt status <experiment_id>
  12. 12. SigOpt. Confidential.12 Final Thoughts... • We're hiring! • Connect with us • Paper: Orchestrate: Infrastructure for Enabling Parallelism during Hyperparameter Optimization, Alexandra Johnson and Michael McCourt
  13. 13. SigOpt. Confidential. Thank you! Any questions? Alexandra Johnson alexandra@sigopt.com @alexandraj777

×