Advanced Optimization for the Enterprise Webinar

SigOpt. Conﬁdential.
Advanced Optimization for the Enterprise
Considerations and use cases
Scott Clark — Co-Founder and CEO, SigOpt
Tuesday, November 5, 2019

SigOpt. Conﬁdential. 2
Accelerate and amplify the
impact of modelers everywhere

Abstract
SigOpt provides an extensive set of advanced features,
which help you, the expert, save time while
increasing performance.
Today, we will be sharing some of the intuition behind
these features, while combining and applying them to
tackle real-world problems

How experimentation impacts your modeling
4
Notebook & Model Framework
Hardware Environment
Data
Preparation
Experimentation, Training, Evaluation
Model
Productionalization
Validation
Serving
Deploying
Monitoring
Managing
Inference
Online Testing
Transformation
Labeling
Pre-Processing
Pipeline Dev.
Feature Eng.
Feature Stores
On-Premise Hybrid Multi-Cloud
Experimentation & Model Optimization
Insights, Tracking,
Collaboration
Model Search,
Hyperparameter Tuning
Resource Scheduler,
Management
...and more

Motivation
1. How to solve a black box optimization problem
2. Why you should optimize using
multiple competing metrics
3. How to continuously and eﬃciently employ your
project’s dedicated compute infrastructure
4. How to tune models with expensive training costs

SigOpt Features
Enterprise
Platform
Optimization
Engine
Experiment
Insights
Reproducibility
Intuitive web dashboards
Cross-team permissions
and collaboration
Advanced experiment
visualizations
Usage insights
Parameter importance
analysis
Multimetric optimization
Continuous, categorical, or
integer parameters
Constraints and
failure regions
Up to 10k observations,
100 parameters
Multitask optimization and
high parallelism
Conditional
parameters
Infrastructure agnostic
REST API
Parallel Resource Scheduler
Black-Box Interface
Tunes without
accessing any data
Libraries for Python,
Java, R, and MATLAB
6

How to solve a
black box optimization problem1

Why black box optimization?
SigOpt was designed to empower you, the practitioner, to re-define most machine
learning problems as black box optimization problems with the added side benefits:
• Amplified performance — incremental gains in accuracy or other success metrics
• Productivity gains — a consistent platform across tasks that facilitates sharing
• Accelerated modeling — early elimination of non-scalable tasks
• Compute efficiency — continuous, full utilization of infrastructure
SigOpt uses an ensemble of Bayesian and Global Optimization methods to solve
these black box optimization problems.
8
Black Box Optimization

Your firewall
Training
Data
AI, ML, DL,
Simulation Model
Model Evaluation
or Backtest
Testing
Data
New
Configurations
Objective
Metric
Better
Results
EXPERIMENT INSIGHTS
Track, organize, analyze and
reproduce any model
ENTERPRISE PLATFORM
Built to fit any stack and scale
with your needs
OPTIMIZATION ENGINE
Explore and exploit with a
variety of techniques
RESTAPI
Configuration
Parameters or
Hyperparameters

EXPERIMENT INSIGHTS
reproduce any model
ENTERPRISE PLATFORM
Built to ﬁt any stack and scale
with your needs
OPTIMIZATION ENGINE
Explore and exploit with a
variety of techniques
RESTAPI
Better
Results

A graphical depiction of the iterative process
11
Build a statistical model Build a statistical model
Choose the next point
to maximize the acquisition function
Choose the next point
to maximize the acquisition function

Gaussian processes: a powerful tool for modeling in spatial statistics
A standard tool for building statistical models is the Gaussian process
[Fasshauer et al, 2015, Fraizer, 2018].
• Assume that function values are jointly normally distributed.
• Apply prior beliefs about mean behavior and covariance between observations.
• Posterior beliefs about unobserved locations can be computed rather easily.
Diﬀerent prior assumptions produce diﬀerent statistical models:
12

Acquisition function: given a model, how should we choose the next point?
An acquisition function is a strategy for defining the utility of a future sample, given the current samples, while
balancing exploration and exploitation [Shahriari et al, 2016].
Different acquisition functions choose different points (EI, PI, KG, etc.).
13
Exploration:
Learning about the whole function f
Exploitation:
Further resolving regions where good f values
have already been observed

SigOpt Blog Posts: Intuition Behind Bayesian Optimization
Some Relevant Blog Posts
● Intuition Behind Covariance Kernels
● Approximation of Data
● Likelihood for Gaussian Processes
● Proﬁle Likelihood vs. Kriging Variance
● Intuition behind Gaussian Processes
● Dealing with Troublesome Metrics
Find more blog posts visit:
https://sigopt.com/blog/

Why you should optimize using
multiple competing metrics
2

Why optimize against multiple competing metrics?
SigOpt allows the user to specify multiple competing metrics for either optimization or tracking
to better align success of the modeling with business value and has the additional benefit of:
• Multiple metrics — The option to defining multiple metrics, which can yield new and
interesting results
• Insights, metric storage — Insights through tracking of optimized and unoptimized
metrics
• Thresholds — The ability to define thresholds for success to better guide the optimizer
We believe this process gives models that deliver more reliable business outcomes and which
are better tied to real world applications.
16
Multiple Competing Metrics

Your firewall
Training
Data
AI, ML, DL,
Simulation Models
Model Evaluation
or Backtest
Testing
Data
New Configurations
Objective
Metric
Better
Results
EXPERIMENT INSIGHTS
Track, organize, analyze and reproduce
any model
ENTERPRISE PLATFORM
Built to fit any stack and scale with your
needs
OPTIMIZATION ENGINE
Explore and exploit with a variety of
techniques
RESTAPI
Configuration
Parameters or
Hyperparameters

Better
Results
EXPERIMENT INSIGHTS
Track, organize, analyze and reproduce
any model
ENTERPRISE PLATFORM
Built to ﬁt any stack and scale with your
needs
OPTIMIZATION ENGINE
Explore and exploit with a variety of
techniques
RESTAPI
Multiple Optimized
and Unoptimized
Metrics

Balancing competing metrics to ﬁnd the Pareto frontier
Most problems of practical relevance involve 2 or more competing metrics.
• Neural networks — Balancing accuracy and inference time
• Materials design — Balancing performance and maintenance cost
• Control systems — Balancing performance and safety
In a situation with Competing Metrics, the set of all eﬃcient points (the Pareto frontier) is the solution.
19
Pareto
Frontier
Feasible
Region

Balancing competing metrics to find the Pareto frontier
As shown before, the goal in multi objective or multi criteria optimization the goal is to find
the optimal set of solution across a set of function [Knowles, 2006].
• This is formulated as finding the maximum of the set functions f1
to fn
over
the same domain x
• No single point exist as the solution, but we are actively trying to maximize the size of the
efficient frontier, which represent the set of solutions
• The solution is found through scalarization methods such as convex combination and
epsilon-constraint
20

Intuition: Convex Combination Scalarization
Idea: If we can convert the multimetric problem into a scalar problem, we can solve this problem using
Bayesian optimization.
One possible scalarization is through a convex combination of the objectives.
21

Balancing competing metrics to find the Pareto frontier with threshold
As shown before, the goal in multi objective or multi criteria optimization the goal is to find
the optimal set of solution across a set of function [Knowles, 2006].
• This is formulated as finding the maximum of the set functions f1
to fn
over the same
domain x
• No single point exist as the solution, but we are actively trying to maximize the size of the
efficient frontier, which represent the set of solutions
• The solution is found through constrained scalarization methods such as convex
combination and epsilon-constraint
• Allow users to change constraints as the search progresses [Letham et al, 2019]
22

Constrained Scalarization
1. Model all metrics independently.
• Requires no prior beliefs of how metrics interact.
• Missing data removed on a per metric basis if unrecorded.
2. Expose the efficient frontier through constrained scalar optimization.
• Enforce user constraints when given.
• Iterate through sub constraints to better resolve efficient frontier, if desired.
• Consider different regions of the frontier when parallelism is possible.
3. Allow users to change constraints as the search progresses.
• Allow the problems/goals to evolve as the user’s understanding changes.
Constraints give customers more control over the circumstances and more ability to understand our actions.
23
Variation on
Expected
Improvement
[Letham et al, 2019]

Intuition: Scalarization and Epsilon Constraints
24

Intuition: Constrained Scalarization and Epsilon Constraints
25

Multimetric Use Case 1
● Category: Time Series
● Task: Sequence Classification
● Model: CNN
● Data: Diatom Images
● Analysis: Accuracy-Time Tradeoff
● Result: Similar accuracy, 33% the inference time
Multimetric Use Case 2
● Category: NLP
● Task: Sentiment Analysis
● Model: CNN
● Data: Rotten Tomatoes Movie Reviews
● Analysis: Accuracy-Time Tradeoff
● Result: ~2% in accuracy versus 50% of training time
Learn more
https://devblogs.nvidia.com/sigopt-deep-learning-
hyperparameter-optimization/
Use Case: Balancing Speed & Accuracy in Deep Learning

Design: Question answering data and memory networks
Data Model
Sources:
Facebook AI Research (FAIR) bAbI dataset: https://research.fb.com/downloads/babi/
Sukhbaatar et al.: https://arxiv.org/abs/1503.08895

Comparison of Bayesian Optimization and Random Search
Setup: Hyperparameter Optimization
Standard Parameters Conditional Parameters

Result: Signiﬁcant boost in consistency, accuracy
Comparison across random search versus Bayesian optimization with conditionals

Result: Highly cost eﬃcient accuracy gains
Comparison across random search versus Bayesian optimization with conditionals
SigOpt is
18.5x as
efficient

How to continuously and eﬃciently
utilize your project’s allotted compute
infrastructure
3

Utilize compute by asynchronous parallel optimization
SigOpt natively handles Parallel Function Evaluation with the primary goal of
minimizing the Overall Wall-Clock Time. Parallelism also provides:
• Faster time-to-results — minimized overall wall-clock time
• Full resource utilization — asynchronous parallel optimization
• Scaling with infrastructure — optimize across the number of available compute resources
We believe this is essential to increase Research Productivity by lowering the time-to-results and
scaling with available infrastructure.
32
Continuously and eﬃciently utilize infrastructure

Your firewall
Training
Data
AI, ML, DL, Simulation
Model
Model Evaluation
or Backtest
Testing
Data
New
Configurations
Objective
Metric
Better
Results
EXPERIMENT INSIGHTS
reproduce any model
ENTERPRISE PLATFORM
Built to fit any stack and scale with
your needs
OPTIMIZATION ENGINE
Explore and exploit with a variety
of techniques
RESTAPI
Configuration
Parameters or
Hyperparameters

Better
Results
EXPERIMENT INSIGHTS
reproduce any model
ENTERPRISE PLATFORM
your needs
OPTIMIZATION ENGINE
of techniques
RESTAPI
Worker

Better
Results
EXPERIMENT INSIGHTS
reproduce any model
ENTERPRISE PLATFORM
your needs
OPTIMIZATION ENGINE
of techniques
RESTAPI
Worker Worker
Worker Worker

Parallel function evaluations
36
Parallel function evaluations are a way of eﬃciently maximizing a function
while using all available compute resources [Ginsbourger et al, 2008, Garcia-Barcos et al. 2019].
• Choosing points by jointly maximizing criteria over the entire set
• Asynchronously evaluating over a collection of points
• Fixing points which are currently being evaluated while sampling new ones
1D - Acquisition Function 2D - Acquisition Function

Parallel optimization: multiple worker nodes jointly optimizes over a given function
37
Parallel bandwidth = 1 Parallel bandwidth = 2 Parallel bandwidth = 3
Parallel bandwidth = 4 Parallel bandwidth = 5
Next point(s) to
evaluate:
Parallel bandwidth
represent the # of
available compute
resources
Statistical Model

Parallelism Use Case
● Category: NLP
● Task: Sentiment Analysis
● Model: CNN
● Data: Rotten Tomatoes Movie Reviews
● Analysis: Predicting Positive vs. Negative Sentiment
● Result: 400x speedup
Learn more
https://aws.amazon.com/blogs/machine-learning/fast-cnn-tuni
ng-with-aws-gpu-instances-and-sigopt/
Use Case: Fast CNN Tuning with AWS GPU Instances

Variables we tested in this experiment
Axes of exploration:
• 6 hyperparameters versus 10 parameters (including SGD and alternate architectures)
• CPU versus GPU compute
• Grid search versus random search versus Bayesian optimization
Results that we considered:
• Accuracy
• Compute cost
• Wall clock time

The parameters to tune
A deep dive

Results
Speed and accuracy
SigOpt helps you train your model faster
and achieve higher accuracy
This results in higher practitioner
productivity and better business
outcomes
While random and grid search for
hyperparameters do yield an accuracy
improvement, SigOpt achieves better
results on both dimensions

Results: 8x more cost eﬃcient performance boost
Detailed performance across diﬀerent optimization processes
Experiment Type Accuracy Trials Epochs CPU Time CPU Cost GPU Time GPU
Cost
Link Percent
Change
% Δ per
Comp $
Default (No Tuning) 75.70 1 50 2 hours $1.82 0 hours $0.04 NA 0 0
Grid Search (SGD
Only)
79.30 729 38394 64 days $1401.38 32 hours $27.58 here 4.60 0.13
Random Search (SGD
Only)
79.94 2400 127092 214 days $4638.86 106 hours $91.29 here 4.24 0.05
SigOpt Search (SGD
Only)
80.40 240 15803 27 days $576.81 13 hours $11.35 here 4.70 0.42
Grid Search (SGD +
Architecture)
Not
Feasible
59049 3109914 5255 days $113511.86 107 days $2233.95 NA NA NA
Random Search (SGD
+ Architecture)
80.12 4000 208729 353 days $7618.61 174 hours $149.94 here 4.42 0.03
SigOpt Search (SGD
+ Architecture)
81.00 400 30060 51 days $1097.19 25 hours $21.59 here 5.30 0.25
% Δ per Comp $ is calculated using GPU compute

Results: the experimentation loop
The training pipeline: from data to optimization and evaluation

How to tune models with
expensive training costs
4

How to efficiently minimize time to optimize any function
SigOpt’s multitask feature is an efficient way for modelers to tune model with an expensive training cost
with the benefit of:
• Faster time-to-market — The ability to bring expensive models into production faster
• Reduction in infrastructure cost — Intelligently leverage infrastructure while reducing cost
Through novel research SigOpt helps the user lower the overall time-to-market,
while reducing the overall compute budget.
45
Expensive Training Cost

Your firewall
Training
Data
AI, ML, DL, Simulation
Model
Model Evaluation or
Backtest
Testing
Data
New
Configurations
Objective
Metric
Better
Results
EXPERIMENT INSIGHTS
reproduce any model
ENTERPRISE PLATFORM
your needs
OPTIMIZATION ENGINE
of techniques
RESTAPI
Configuration
Parameters or
Hyperparameters

Better
Results
EXPERIMENT INSIGHTS
reproduce any model
ENTERPRISE PLATFORM
your needs
OPTIMIZATION ENGINE
of techniques
RESTAPI

Using cheap or free information to speed learning
48
Sources:
Aaron Klein, Frank Hutter, et al.: https://arxiv.org/abs/1605.07079
SigOpt allows to the user to deﬁne lower-cost functions in order to quickly optimize expensive functions
• Cheaper-cost functions can be ﬂexible (fewer epochs, subsampled data, other custom features)
• Use cheaper tasks earlier in the tuning process to explore
• Inform more expensive tasks later by exploiting what we learn
• In the process, reduce the full time required to tune an expensive model

Using cheap or free information to speed learning
We can build better models using inaccurate data to help point the
actual optimization in the right direction with less cost.
• Using a warm start through multi-task learning logic [Swersky et al, 2014]
• Combining good anytime performance with active learning [Klein et al, 2018]
• Accepting data from multiple sources without priors [Poloczek et al, 2017]
49

Use Case: Image Classification on a Budget
Use Case
● Category: Computer Vision
● Task: Image Classification
● Model: CNN
● Data: Stanford Cars Dataset
● Analysis: Architecture Comparison
● Result: 2.4% accuracy gain with a much shallower
model
Learn more
https://mlconf.com/blog/insights-for-building-high-performing-
image-classification-models/

Architecture: Classifying images of cars using ResNet
51
Convolutions Classiﬁcation
ResNet
Input
Acura
TLX
2015
Output
Label
Sources:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: https://arxiv.org/abs/1512.03385

Training setup comparison
ImageNet Pretrained
Convolutional Layers
Fully Connected Layer
ImageNet Pretrained
Convolutional Layers
Fully Connected Layer
Input
Convolutional Features
Classification
Input
Convolutional Features
Classification
Fine Tuning Feature Extractor
Tuning
Tuning

Hyperparameter setup
53
Hyperparameter Lower Bound Upper Bound
Log Learning Rate 1.2e-4 1.0
Learning Rate Scheduler 0 0.99
Batch Size (powers of 2) 16 256
Nesterov False True
Log Weight Decay 1.2e-5 1.0
Momentum 0 0.9
Scheduler Step 1 20

SigOpt. Conﬁdential.54
Insight: Multitask eﬃciency at the hyperparameter level
Example: Learning rate accuracy and values by cost of task over time
Progression of observations over time Accuracy and value for each observation Parameter importance analysis

Fine-tuning the smaller
network signiﬁcantly
outperforms feature
extraction on a bigger
network
Results: Optimizing and tuning the full network outperforms
55
Multitask optimization
drives signiﬁcant
performance gains
+3.92%
+1.58%

Implication: Fine-tuning significantly outperforms
Cost Breakdown for Multitask Optimization
Cost efficiency Feature Extractor ResNet 50 Fine-Tuning ResNet 18
Hours per training 4.08 4.2
Observations 220 220
Number of Runs 1 1
Total compute hours 898 924
Cost per GPU-hour $0.90 $0.90
% Improvement 1.58% 3.92%
Total compute cost $808 $832
cost ($) per %
improvement $509 $212
Similar Compute Cost
Similar Wall-Clock Time
Fine-Tuning Significantly
More Efficient and Effective

Thank you

Advanced Optimization for the Enterprise Webinar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Optimization for the Enterprise Webinar

Similar to Advanced Optimization for the Enterprise Webinar (20)

More from SigOpt

More from SigOpt (15)

Recently uploaded

Recently uploaded (20)

Advanced Optimization for the Enterprise Webinar