SlideShare a Scribd company logo
ML Infrastructure
alexandra@sigopt.com | @alexandraj777
About Me
● Alum of Carnegie Mellon SCS
● Joined SigOpt in 2015
● Tech lead for the Platform Team, handling
frontend, backend, infrastructure and
testing
● Recent project in ML Infrastructure: SigOpt
Orchestrate
● Co-organizer for Bay Area chapter of
Women in Machine Learning and Data
Science (join us!)
ML Infrastructure
solves data scientists'
problems using
infrastructure tools
Challenge:
● Data scientists want to maximize the
performance of their models
● SigOpt provides an API for
hyperparameter optimization (HPO)
● SigOpt HPO helps data scientists
maximize the performance of their models!
● Data scientists need to use clusters to
properly perform HPO
Machine Learning Infrastructure
Challenge:
● Data scientists want to maximize the
performance of their models
● SigOpt provides an API for
hyperparameter optimization (HPO)
● SigOpt HPO helps data scientists
maximize the performance of their models!
● Data scientists need to use clusters to
properly perform HPO
Machine Learning Infrastructure
Data scientists specialize in:
● Gathering data
● Building models
● Extracting business insights
Infrastructure engineers specialize in:
● Building shared tools
● Application scalability and performance
● Keeping track of interactions between
large distributed systems
Case Study: Building SigOpt Orchestrate
● Project started in 2018 to bridge ML and
infrastructure
● What problems did our customers ask us to
solve?
● How did a challenge for the user turn into a
technical problem?
● Which tools / technologies did we use?
Challenge #1: Can't Train Model on Laptop
Problem: Setup each remote machine
Initial Solution:
● Write a setup script to install
dependencies
● SCP data, code, and setup script to every
remote machine
Solution #1: Containerize!
Problem: Setup each remote machine
New Solution:
● Containerize code and dependencies on
the user's local environment
● Push the container to a registry
● Each machine pulls the container from a
registry
Challenge #2: Start Training in Parallel
Problem: Kick off the hyperparameter
optimization job on six machines at once
Initial Solution:
● Open a tmux window on every remote
instance
● SSH over command to run setup script
into each tmux window
● SSH over command to train model into
each tmux window
Solution #2: Kubernetes!
Problem: Kick off the hyperparameter
optimization job on six machines at once
New Solution:
● Spin up AWS EKS (Kubernetes) cluster
● Create a job spec
○ "run 6 copies of this container at the same
time"
● Submit job spec to Kubernetes API
● Kubernetes starts the job on the cluster
Challenge #3: View Progress and Debug
Problem: View the status of a hyperparameter
optimization job at a glance
Initial Solution:
● Save hostname and error information as
metadata in calls to external API
● SSH into machines and view the logs
directly (pre-Kubernetes)
● Use Kubernetes CLI to view logs
Solution #3: Build a CLI!
Problem: View the status of a hyperparameter
optimization job at a glance
New Solution:
● Write an interface for the data scientist to
interact with the infrastructure tool
● We chose a command line interface
● Serves as an abstraction on top of
Kubernetes APIs + externals APIs
● Screenshots (top and bottom)
○ sigopt logs <experiment_id>
○ sigopt status <experiment_id>
Final Thoughts...
Paper: Orchestrate: Infrastructure for Enabling
Parallelism during Hyperparameter Optimization,
Alexandra Johnson and Michael McCourt
SigOpt is free for academics!
We're hiring research engineers/interns and
software engineers/interns!
Thank You! Questions?

More Related Content

What's hot

Multitenant SaaS Apps In Rails By Iqbal Hasnan
Multitenant SaaS Apps In Rails By Iqbal HasnanMultitenant SaaS Apps In Rails By Iqbal Hasnan
Multitenant SaaS Apps In Rails By Iqbal Hasnan
iqbal hasnan
 
Trailblazer Rails Architecture
Trailblazer Rails ArchitectureTrailblazer Rails Architecture
Trailblazer Rails Architecture
iqbal hasnan
 
Next Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus WirelessNext Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus WirelessDavid Ko
 
Rethinking HTTP Apps using Ratpack
Rethinking HTTP Apps using RatpackRethinking HTTP Apps using Ratpack
Rethinking HTTP Apps using Ratpack
Naresha K
 
So you want to write a cloud function
So you want to write a cloud functionSo you want to write a cloud function
So you want to write a cloud function
Elad Hirsch
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment Strategies
Piotr Perzyna
 
Charles_Qian_Resume
Charles_Qian_ResumeCharles_Qian_Resume
Charles_Qian_ResumeCharles Qian
 
11 CLI tools every developer should know | DevNation Tech Talk
11 CLI tools every developer should know | DevNation Tech Talk11 CLI tools every developer should know | DevNation Tech Talk
11 CLI tools every developer should know | DevNation Tech Talk
Red Hat Developers
 
Gophercon 2018: Kubernetes api golang
Gophercon 2018: Kubernetes api golangGophercon 2018: Kubernetes api golang
Gophercon 2018: Kubernetes api golang
Vishal Biyani
 
From zero to test in 60 seconds
From zero to test in 60  secondsFrom zero to test in 60  seconds
From zero to test in 60 seconds
Hugh McCamphill
 
Ray distributed python framework
Ray distributed python framework Ray distributed python framework
Ray distributed python framework
AryanJadon3
 
Concurrency in Swift
Concurrency in SwiftConcurrency in Swift
Concurrency in Swift
Seven Peaks Speaks
 
Secret Deployment Events API features for mabl
Secret Deployment Events API features for mablSecret Deployment Events API features for mabl
Secret Deployment Events API features for mabl
Matthew Stein
 
Event driven-arch
Event driven-archEvent driven-arch
Event driven-arch
Mohammed Shoaib
 
Aoyagi Lab Colloquium - 2015-06-01
Aoyagi Lab Colloquium - 2015-06-01Aoyagi Lab Colloquium - 2015-06-01
Aoyagi Lab Colloquium - 2015-06-01Michele Bianchi
 
Prometheus - Utah Software Architecture Meetup - Clint Checketts
Prometheus - Utah Software Architecture Meetup - Clint CheckettsPrometheus - Utah Software Architecture Meetup - Clint Checketts
Prometheus - Utah Software Architecture Meetup - Clint Checketts
clintchecketts
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
Resume_Akash_Mehta_Mechanical_Software
Resume_Akash_Mehta_Mechanical_SoftwareResume_Akash_Mehta_Mechanical_Software
Resume_Akash_Mehta_Mechanical_SoftwareAkash Mehta
 
Spark Pitfalls meetup UnderscoreIL
Spark Pitfalls meetup UnderscoreILSpark Pitfalls meetup UnderscoreIL
Spark Pitfalls meetup UnderscoreIL
lioron22
 

What's hot (20)

Multitenant SaaS Apps In Rails By Iqbal Hasnan
Multitenant SaaS Apps In Rails By Iqbal HasnanMultitenant SaaS Apps In Rails By Iqbal Hasnan
Multitenant SaaS Apps In Rails By Iqbal Hasnan
 
Trailblazer Rails Architecture
Trailblazer Rails ArchitectureTrailblazer Rails Architecture
Trailblazer Rails Architecture
 
Next Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus WirelessNext Generation Automation in Ruckus Wireless
Next Generation Automation in Ruckus Wireless
 
Rethinking HTTP Apps using Ratpack
Rethinking HTTP Apps using RatpackRethinking HTTP Apps using Ratpack
Rethinking HTTP Apps using Ratpack
 
So you want to write a cloud function
So you want to write a cloud functionSo you want to write a cloud function
So you want to write a cloud function
 
Deployment Strategies
Deployment StrategiesDeployment Strategies
Deployment Strategies
 
Charles_Qian_Resume
Charles_Qian_ResumeCharles_Qian_Resume
Charles_Qian_Resume
 
11 CLI tools every developer should know | DevNation Tech Talk
11 CLI tools every developer should know | DevNation Tech Talk11 CLI tools every developer should know | DevNation Tech Talk
11 CLI tools every developer should know | DevNation Tech Talk
 
Gophercon 2018: Kubernetes api golang
Gophercon 2018: Kubernetes api golangGophercon 2018: Kubernetes api golang
Gophercon 2018: Kubernetes api golang
 
From zero to test in 60 seconds
From zero to test in 60  secondsFrom zero to test in 60  seconds
From zero to test in 60 seconds
 
Ray distributed python framework
Ray distributed python framework Ray distributed python framework
Ray distributed python framework
 
Concurrency in Swift
Concurrency in SwiftConcurrency in Swift
Concurrency in Swift
 
Migrator.net
Migrator.net Migrator.net
Migrator.net
 
Secret Deployment Events API features for mabl
Secret Deployment Events API features for mablSecret Deployment Events API features for mabl
Secret Deployment Events API features for mabl
 
Event driven-arch
Event driven-archEvent driven-arch
Event driven-arch
 
Aoyagi Lab Colloquium - 2015-06-01
Aoyagi Lab Colloquium - 2015-06-01Aoyagi Lab Colloquium - 2015-06-01
Aoyagi Lab Colloquium - 2015-06-01
 
Prometheus - Utah Software Architecture Meetup - Clint Checketts
Prometheus - Utah Software Architecture Meetup - Clint CheckettsPrometheus - Utah Software Architecture Meetup - Clint Checketts
Prometheus - Utah Software Architecture Meetup - Clint Checketts
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Resume_Akash_Mehta_Mechanical_Software
Resume_Akash_Mehta_Mechanical_SoftwareResume_Akash_Mehta_Mechanical_Software
Resume_Akash_Mehta_Mechanical_Software
 
Spark Pitfalls meetup UnderscoreIL
Spark Pitfalls meetup UnderscoreILSpark Pitfalls meetup UnderscoreIL
Spark Pitfalls meetup UnderscoreIL
 

Similar to Machine Learning Infrastructure

SigOpt at MLconf - Reducing Operational Barriers to Model Training
SigOpt at MLconf - Reducing Operational Barriers to Model TrainingSigOpt at MLconf - Reducing Operational Barriers to Model Training
SigOpt at MLconf - Reducing Operational Barriers to Model Training
SigOpt
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
SigOpt
 
Cloud operations with streaming analytics using big data tools
Cloud operations with streaming analytics using big data toolsCloud operations with streaming analytics using big data tools
Cloud operations with streaming analytics using big data tools
Miguel Pérez Colino
 
Path to continuous delivery
Path to continuous deliveryPath to continuous delivery
Path to continuous delivery
Anirudh Bhatnagar
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
DataWorks Summit
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Embarcados
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Fei Chen
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
doppenhe
 
Py data scikit-production
Py data scikit-productionPy data scikit-production
Py data scikit-productionTuri, Inc.
 
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
StormForge .io
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
Seldon
 
Heroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success storyHeroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success story
Jérémy Wimsingues
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Codemotion
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Codemotion
 
Large scale virtual Machine log collector (Project-Report)
Large scale virtual Machine log collector (Project-Report)Large scale virtual Machine log collector (Project-Report)
Large scale virtual Machine log collector (Project-Report)Gaurav Bhardwaj
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
Tulipp. Eu
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Iulian Pintoiu
 

Similar to Machine Learning Infrastructure (20)

SigOpt at MLconf - Reducing Operational Barriers to Model Training
SigOpt at MLconf - Reducing Operational Barriers to Model TrainingSigOpt at MLconf - Reducing Operational Barriers to Model Training
SigOpt at MLconf - Reducing Operational Barriers to Model Training
 
Machine Learning Infrastructure
Machine Learning InfrastructureMachine Learning Infrastructure
Machine Learning Infrastructure
 
Cloud operations with streaming analytics using big data tools
Cloud operations with streaming analytics using big data toolsCloud operations with streaming analytics using big data tools
Cloud operations with streaming analytics using big data tools
 
Path to continuous delivery
Path to continuous deliveryPath to continuous delivery
Path to continuous delivery
 
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache FlinkCloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
Cloud Operations with Streaming Analytics using Apache NiFi and Apache Flink
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
 
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning InfrastructureML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
 
Deploying ML models in the enterprise
Deploying ML models in the enterpriseDeploying ML models in the enterprise
Deploying ML models in the enterprise
 
Py data scikit-production
Py data scikit-productionPy data scikit-production
Py data scikit-production
 
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
 
TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform TensorFlow 16: Building a Data Science Platform
TensorFlow 16: Building a Data Science Platform
 
Heroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success storyHeroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success story
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018 Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Rome 2018
 
Large scale virtual Machine log collector (Project-Report)
Large scale virtual Machine log collector (Project-Report)Large scale virtual Machine log collector (Project-Report)
Large scale virtual Machine log collector (Project-Report)
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019Productionizing Machine Learning - Bigdata meetup 5-06-2019
Productionizing Machine Learning - Bigdata meetup 5-06-2019
 

More from SigOpt

Optimizing BERT and Natural Language Models with SigOpt Experiment Management
Optimizing BERT and Natural Language Models with SigOpt Experiment ManagementOptimizing BERT and Natural Language Models with SigOpt Experiment Management
Optimizing BERT and Natural Language Models with SigOpt Experiment Management
SigOpt
 
Experiment Management for the Enterprise
Experiment Management for the EnterpriseExperiment Management for the Enterprise
Experiment Management for the Enterprise
SigOpt
 
Efficient NLP by Distilling BERT and Multimetric Optimization
Efficient NLP by Distilling BERT and Multimetric OptimizationEfficient NLP by Distilling BERT and Multimetric Optimization
Efficient NLP by Distilling BERT and Multimetric Optimization
SigOpt
 
Detecting COVID-19 Cases with Deep Learning
Detecting COVID-19 Cases with Deep LearningDetecting COVID-19 Cases with Deep Learning
Detecting COVID-19 Cases with Deep Learning
SigOpt
 
Metric Management: a SigOpt Applied Use Case
Metric Management: a SigOpt Applied Use CaseMetric Management: a SigOpt Applied Use Case
Metric Management: a SigOpt Applied Use Case
SigOpt
 
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric StrategyTuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
SigOpt
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
SigOpt
 
Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1
SigOpt
 
Tuning Data Augmentation to Boost Model Performance
Tuning Data Augmentation to Boost Model PerformanceTuning Data Augmentation to Boost Model Performance
Tuning Data Augmentation to Boost Model Performance
SigOpt
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
SigOpt
 
Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019
SigOpt
 
Tuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarTuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques Webinar
SigOpt
 
SigOpt at Ai4 Finance—Modeling at Scale
SigOpt at Ai4 Finance—Modeling at Scale SigOpt at Ai4 Finance—Modeling at Scale
SigOpt at Ai4 Finance—Modeling at Scale
SigOpt
 
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
SigOpt
 
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
SigOpt
 
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling PlatformsSigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt
 
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the UntunableSigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the Untunable
SigOpt
 
SigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimizationSigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimization
SigOpt
 
Lessons for an enterprise approach to modeling at scale
Lessons for an enterprise approach to modeling at scaleLessons for an enterprise approach to modeling at scale
Lessons for an enterprise approach to modeling at scale
SigOpt
 
Modeling at scale in systematic trading
Modeling at scale in systematic tradingModeling at scale in systematic trading
Modeling at scale in systematic trading
SigOpt
 

More from SigOpt (20)

Optimizing BERT and Natural Language Models with SigOpt Experiment Management
Optimizing BERT and Natural Language Models with SigOpt Experiment ManagementOptimizing BERT and Natural Language Models with SigOpt Experiment Management
Optimizing BERT and Natural Language Models with SigOpt Experiment Management
 
Experiment Management for the Enterprise
Experiment Management for the EnterpriseExperiment Management for the Enterprise
Experiment Management for the Enterprise
 
Efficient NLP by Distilling BERT and Multimetric Optimization
Efficient NLP by Distilling BERT and Multimetric OptimizationEfficient NLP by Distilling BERT and Multimetric Optimization
Efficient NLP by Distilling BERT and Multimetric Optimization
 
Detecting COVID-19 Cases with Deep Learning
Detecting COVID-19 Cases with Deep LearningDetecting COVID-19 Cases with Deep Learning
Detecting COVID-19 Cases with Deep Learning
 
Metric Management: a SigOpt Applied Use Case
Metric Management: a SigOpt Applied Use CaseMetric Management: a SigOpt Applied Use Case
Metric Management: a SigOpt Applied Use Case
 
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric StrategyTuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
 
Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1Tuning for Systematic Trading: Talk 1
Tuning for Systematic Trading: Talk 1
 
Tuning Data Augmentation to Boost Model Performance
Tuning Data Augmentation to Boost Model PerformanceTuning Data Augmentation to Boost Model Performance
Tuning Data Augmentation to Boost Model Performance
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
 
Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019Modeling at Scale: SigOpt at TWIMLcon 2019
Modeling at Scale: SigOpt at TWIMLcon 2019
 
Tuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques WebinarTuning 2.0: Advanced Optimization Techniques Webinar
Tuning 2.0: Advanced Optimization Techniques Webinar
 
SigOpt at Ai4 Finance—Modeling at Scale
SigOpt at Ai4 Finance—Modeling at Scale SigOpt at Ai4 Finance—Modeling at Scale
SigOpt at Ai4 Finance—Modeling at Scale
 
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
 
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
 
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling PlatformsSigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
 
SigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the UntunableSigOpt at GTC - Tuning the Untunable
SigOpt at GTC - Tuning the Untunable
 
SigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimizationSigOpt at GTC - Reducing operational barriers to optimization
SigOpt at GTC - Reducing operational barriers to optimization
 
Lessons for an enterprise approach to modeling at scale
Lessons for an enterprise approach to modeling at scaleLessons for an enterprise approach to modeling at scale
Lessons for an enterprise approach to modeling at scale
 
Modeling at scale in systematic trading
Modeling at scale in systematic tradingModeling at scale in systematic trading
Modeling at scale in systematic trading
 

Recently uploaded

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 

Recently uploaded (20)

Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 

Machine Learning Infrastructure

  • 2. About Me ● Alum of Carnegie Mellon SCS ● Joined SigOpt in 2015 ● Tech lead for the Platform Team, handling frontend, backend, infrastructure and testing ● Recent project in ML Infrastructure: SigOpt Orchestrate ● Co-organizer for Bay Area chapter of Women in Machine Learning and Data Science (join us!)
  • 3. ML Infrastructure solves data scientists' problems using infrastructure tools
  • 4. Challenge: ● Data scientists want to maximize the performance of their models ● SigOpt provides an API for hyperparameter optimization (HPO) ● SigOpt HPO helps data scientists maximize the performance of their models! ● Data scientists need to use clusters to properly perform HPO Machine Learning Infrastructure
  • 5. Challenge: ● Data scientists want to maximize the performance of their models ● SigOpt provides an API for hyperparameter optimization (HPO) ● SigOpt HPO helps data scientists maximize the performance of their models! ● Data scientists need to use clusters to properly perform HPO Machine Learning Infrastructure Data scientists specialize in: ● Gathering data ● Building models ● Extracting business insights Infrastructure engineers specialize in: ● Building shared tools ● Application scalability and performance ● Keeping track of interactions between large distributed systems
  • 6. Case Study: Building SigOpt Orchestrate ● Project started in 2018 to bridge ML and infrastructure ● What problems did our customers ask us to solve? ● How did a challenge for the user turn into a technical problem? ● Which tools / technologies did we use?
  • 7. Challenge #1: Can't Train Model on Laptop Problem: Setup each remote machine Initial Solution: ● Write a setup script to install dependencies ● SCP data, code, and setup script to every remote machine
  • 8. Solution #1: Containerize! Problem: Setup each remote machine New Solution: ● Containerize code and dependencies on the user's local environment ● Push the container to a registry ● Each machine pulls the container from a registry
  • 9. Challenge #2: Start Training in Parallel Problem: Kick off the hyperparameter optimization job on six machines at once Initial Solution: ● Open a tmux window on every remote instance ● SSH over command to run setup script into each tmux window ● SSH over command to train model into each tmux window
  • 10. Solution #2: Kubernetes! Problem: Kick off the hyperparameter optimization job on six machines at once New Solution: ● Spin up AWS EKS (Kubernetes) cluster ● Create a job spec ○ "run 6 copies of this container at the same time" ● Submit job spec to Kubernetes API ● Kubernetes starts the job on the cluster
  • 11. Challenge #3: View Progress and Debug Problem: View the status of a hyperparameter optimization job at a glance Initial Solution: ● Save hostname and error information as metadata in calls to external API ● SSH into machines and view the logs directly (pre-Kubernetes) ● Use Kubernetes CLI to view logs
  • 12. Solution #3: Build a CLI! Problem: View the status of a hyperparameter optimization job at a glance New Solution: ● Write an interface for the data scientist to interact with the infrastructure tool ● We chose a command line interface ● Serves as an abstraction on top of Kubernetes APIs + externals APIs ● Screenshots (top and bottom) ○ sigopt logs <experiment_id> ○ sigopt status <experiment_id>
  • 13. Final Thoughts... Paper: Orchestrate: Infrastructure for Enabling Parallelism during Hyperparameter Optimization, Alexandra Johnson and Michael McCourt SigOpt is free for academics! We're hiring research engineers/interns and software engineers/interns!