Advertisement
Advertisement

More Related Content

Advertisement
Advertisement

Advanced Optimization for the Enterprise Webinar

  1. SigOpt. Confidential. Advanced Optimization for the Enterprise Considerations and use cases Scott Clark — Co-Founder and CEO, SigOpt Tuesday, November 5, 2019
  2. SigOpt. Confidential. 2 Accelerate and amplify the impact of modelers everywhere
  3. SigOpt. Confidential. 3 Abstract SigOpt provides an extensive set of advanced features, which help you, the expert, save time while increasing performance. Today, we will be sharing some of the intuition behind these features, while combining and applying them to tackle real-world problems
  4. SigOpt. Confidential. How experimentation impacts your modeling 4 Notebook & Model Framework Hardware Environment Data Preparation Experimentation, Training, Evaluation Model Productionalization Validation Serving Deploying Monitoring Managing Inference Online Testing Transformation Labeling Pre-Processing Pipeline Dev. Feature Eng. Feature Stores On-Premise Hybrid Multi-Cloud Experimentation & Model Optimization Insights, Tracking, Collaboration Model Search, Hyperparameter Tuning Resource Scheduler, Management ...and more
  5. SigOpt. Confidential. 5 Motivation 1. How to solve a black box optimization problem 2. Why you should optimize using multiple competing metrics 3. How to continuously and efficiently employ your project’s dedicated compute infrastructure 4. How to tune models with expensive training costs
  6. SigOpt. Confidential. SigOpt Features Enterprise Platform Optimization Engine Experiment Insights Reproducibility Intuitive web dashboards Cross-team permissions and collaboration Advanced experiment visualizations Usage insights Parameter importance analysis Multimetric optimization Continuous, categorical, or integer parameters Constraints and failure regions Up to 10k observations, 100 parameters Multitask optimization and high parallelism Conditional parameters Infrastructure agnostic REST API Parallel Resource Scheduler Black-Box Interface Tunes without accessing any data Libraries for Python, Java, R, and MATLAB 6
  7. SigOpt. Confidential. How to solve a black box optimization problem1
  8. SigOpt. Confidential. Why black box optimization? SigOpt was designed to empower you, the practitioner, to re-define most machine learning problems as black box optimization problems with the added side benefits: • Amplified performance — incremental gains in accuracy or other success metrics • Productivity gains — a consistent platform across tasks that facilitates sharing • Accelerated modeling — early elimination of non-scalable tasks • Compute efficiency — continuous, full utilization of infrastructure SigOpt uses an ensemble of Bayesian and Global Optimization methods to solve these black box optimization problems. 8 Black Box Optimization
  9. SigOpt. Confidential. Your firewall Training Data AI, ML, DL, Simulation Model Model Evaluation or Backtest Testing Data New Configurations Objective Metric Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Configuration Parameters or Hyperparameters Black Box Optimization
  10. SigOpt. Confidential. EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Black Box Optimization Better Results
  11. SigOpt. Confidential. A graphical depiction of the iterative process 11 Build a statistical model Build a statistical model Choose the next point to maximize the acquisition function Black Box Optimization Choose the next point to maximize the acquisition function
  12. SigOpt. Confidential. Gaussian processes: a powerful tool for modeling in spatial statistics A standard tool for building statistical models is the Gaussian process [Fasshauer et al, 2015, Fraizer, 2018]. • Assume that function values are jointly normally distributed. • Apply prior beliefs about mean behavior and covariance between observations. • Posterior beliefs about unobserved locations can be computed rather easily. Different prior assumptions produce different statistical models: 12 Black Box Optimization
  13. SigOpt. Confidential. Acquisition function: given a model, how should we choose the next point? An acquisition function is a strategy for defining the utility of a future sample, given the current samples, while balancing exploration and exploitation [Shahriari et al, 2016]. Different acquisition functions choose different points (EI, PI, KG, etc.). 13 Black Box Optimization Exploration: Learning about the whole function f Exploitation: Further resolving regions where good f values have already been observed
  14. SigOpt Blog Posts: Intuition Behind Bayesian Optimization Some Relevant Blog Posts ● Intuition Behind Covariance Kernels ● Approximation of Data ● Likelihood for Gaussian Processes ● Profile Likelihood vs. Kriging Variance ● Intuition behind Gaussian Processes ● Dealing with Troublesome Metrics Find more blog posts visit: https://sigopt.com/blog/
  15. SigOpt. Confidential. Why you should optimize using multiple competing metrics 2
  16. SigOpt. Confidential. Why optimize against multiple competing metrics? SigOpt allows the user to specify multiple competing metrics for either optimization or tracking to better align success of the modeling with business value and has the additional benefit of: • Multiple metrics — The option to defining multiple metrics, which can yield new and interesting results • Insights, metric storage — Insights through tracking of optimized and unoptimized metrics • Thresholds — The ability to define thresholds for success to better guide the optimizer We believe this process gives models that deliver more reliable business outcomes and which are better tied to real world applications. 16 Multiple Competing Metrics
  17. SigOpt. Confidential. Your firewall Training Data AI, ML, DL, Simulation Models Model Evaluation or Backtest Testing Data New Configurations Objective Metric Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Configuration Parameters or Hyperparameters Multiple Competing Metrics
  18. SigOpt. Confidential. Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Multiple Competing Metrics Multiple Optimized and Unoptimized Metrics
  19. SigOpt. Confidential. Multiple Competing Metrics Balancing competing metrics to find the Pareto frontier Most problems of practical relevance involve 2 or more competing metrics. • Neural networks — Balancing accuracy and inference time • Materials design — Balancing performance and maintenance cost • Control systems — Balancing performance and safety In a situation with Competing Metrics, the set of all efficient points (the Pareto frontier) is the solution. 19 Pareto Frontier Feasible Region
  20. SigOpt. Confidential. Balancing competing metrics to find the Pareto frontier As shown before, the goal in multi objective or multi criteria optimization the goal is to find the optimal set of solution across a set of function [Knowles, 2006]. • This is formulated as finding the maximum of the set functions f1 to fn over the same domain x • No single point exist as the solution, but we are actively trying to maximize the size of the efficient frontier, which represent the set of solutions • The solution is found through scalarization methods such as convex combination and epsilon-constraint 20 Multiple Competing Metrics
  21. SigOpt. Confidential. Multiple Competing Metrics Intuition: Convex Combination Scalarization Idea: If we can convert the multimetric problem into a scalar problem, we can solve this problem using Bayesian optimization. One possible scalarization is through a convex combination of the objectives. 21
  22. SigOpt. Confidential. Balancing competing metrics to find the Pareto frontier with threshold As shown before, the goal in multi objective or multi criteria optimization the goal is to find the optimal set of solution across a set of function [Knowles, 2006]. • This is formulated as finding the maximum of the set functions f1 to fn over the same domain x • No single point exist as the solution, but we are actively trying to maximize the size of the efficient frontier, which represent the set of solutions • The solution is found through constrained scalarization methods such as convex combination and epsilon-constraint • Allow users to change constraints as the search progresses [Letham et al, 2019] 22 Multiple Competing Metrics
  23. SigOpt. Confidential. Multiple Competing Metrics Constrained Scalarization 1. Model all metrics independently. • Requires no prior beliefs of how metrics interact. • Missing data removed on a per metric basis if unrecorded. 2. Expose the efficient frontier through constrained scalar optimization. • Enforce user constraints when given. • Iterate through sub constraints to better resolve efficient frontier, if desired. • Consider different regions of the frontier when parallelism is possible. 3. Allow users to change constraints as the search progresses. • Allow the problems/goals to evolve as the user’s understanding changes. Constraints give customers more control over the circumstances and more ability to understand our actions. 23 Variation on Expected Improvement [Letham et al, 2019]
  24. SigOpt. Confidential. Intuition: Scalarization and Epsilon Constraints 24 Multiple Competing Metrics
  25. SigOpt. Confidential. Intuition: Constrained Scalarization and Epsilon Constraints 25 Multiple Competing Metrics
  26. Multimetric Use Case 1 ● Category: Time Series ● Task: Sequence Classification ● Model: CNN ● Data: Diatom Images ● Analysis: Accuracy-Time Tradeoff ● Result: Similar accuracy, 33% the inference time Multimetric Use Case 2 ● Category: NLP ● Task: Sentiment Analysis ● Model: CNN ● Data: Rotten Tomatoes Movie Reviews ● Analysis: Accuracy-Time Tradeoff ● Result: ~2% in accuracy versus 50% of training time Learn more https://devblogs.nvidia.com/sigopt-deep-learning- hyperparameter-optimization/ Use Case: Balancing Speed & Accuracy in Deep Learning
  27. Design: Question answering data and memory networks Data Model Sources: Facebook AI Research (FAIR) bAbI dataset: https://research.fb.com/downloads/babi/ Sukhbaatar et al.: https://arxiv.org/abs/1503.08895
  28. Comparison of Bayesian Optimization and Random Search Setup: Hyperparameter Optimization Standard Parameters Conditional Parameters
  29. Result: Significant boost in consistency, accuracy Comparison across random search versus Bayesian optimization with conditionals
  30. Result: Highly cost efficient accuracy gains Comparison across random search versus Bayesian optimization with conditionals SigOpt is 18.5x as efficient
  31. SigOpt. Confidential. How to continuously and efficiently utilize your project’s allotted compute infrastructure 3
  32. SigOpt. Confidential. Utilize compute by asynchronous parallel optimization SigOpt natively handles Parallel Function Evaluation with the primary goal of minimizing the Overall Wall-Clock Time. Parallelism also provides: • Faster time-to-results — minimized overall wall-clock time • Full resource utilization — asynchronous parallel optimization • Scaling with infrastructure — optimize across the number of available compute resources We believe this is essential to increase Research Productivity by lowering the time-to-results and scaling with available infrastructure. 32 Continuously and efficiently utilize infrastructure
  33. SigOpt. Confidential. Your firewall Training Data AI, ML, DL, Simulation Model Model Evaluation or Backtest Testing Data New Configurations Objective Metric Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Configuration Parameters or Hyperparameters Continuously and efficiently utilize infrastructure
  34. SigOpt. Confidential. Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Worker Continuously and efficiently utilize infrastructure
  35. SigOpt. Confidential. Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Worker Worker Worker Worker Continuously and efficiently utilize infrastructure
  36. SigOpt. Confidential. Parallel function evaluations 36 Parallel function evaluations are a way of efficiently maximizing a function while using all available compute resources [Ginsbourger et al, 2008, Garcia-Barcos et al. 2019]. • Choosing points by jointly maximizing criteria over the entire set • Asynchronously evaluating over a collection of points • Fixing points which are currently being evaluated while sampling new ones Continuously and efficiently utilize infrastructure 1D - Acquisition Function 2D - Acquisition Function
  37. SigOpt. Confidential. Parallel optimization: multiple worker nodes jointly optimizes over a given function 37 Parallel bandwidth = 1 Parallel bandwidth = 2 Parallel bandwidth = 3 Parallel bandwidth = 4 Parallel bandwidth = 5 Next point(s) to evaluate: Parallel bandwidth represent the # of available compute resources Statistical Model Continuously and efficiently utilize infrastructure
  38. Parallelism Use Case ● Category: NLP ● Task: Sentiment Analysis ● Model: CNN ● Data: Rotten Tomatoes Movie Reviews ● Analysis: Predicting Positive vs. Negative Sentiment ● Result: 400x speedup Learn more https://aws.amazon.com/blogs/machine-learning/fast-cnn-tuni ng-with-aws-gpu-instances-and-sigopt/ Use Case: Fast CNN Tuning with AWS GPU Instances
  39. Variables we tested in this experiment Axes of exploration: • 6 hyperparameters versus 10 parameters (including SGD and alternate architectures) • CPU versus GPU compute • Grid search versus random search versus Bayesian optimization Results that we considered: • Accuracy • Compute cost • Wall clock time
  40. The parameters to tune A deep dive
  41. Results Speed and accuracy SigOpt helps you train your model faster and achieve higher accuracy This results in higher practitioner productivity and better business outcomes While random and grid search for hyperparameters do yield an accuracy improvement, SigOpt achieves better results on both dimensions
  42. Results: 8x more cost efficient performance boost Detailed performance across different optimization processes Experiment Type Accuracy Trials Epochs CPU Time CPU Cost GPU Time GPU Cost Link Percent Change % Δ per Comp $ Default (No Tuning) 75.70 1 50 2 hours $1.82 0 hours $0.04 NA 0 0 Grid Search (SGD Only) 79.30 729 38394 64 days $1401.38 32 hours $27.58 here 4.60 0.13 Random Search (SGD Only) 79.94 2400 127092 214 days $4638.86 106 hours $91.29 here 4.24 0.05 SigOpt Search (SGD Only) 80.40 240 15803 27 days $576.81 13 hours $11.35 here 4.70 0.42 Grid Search (SGD + Architecture) Not Feasible 59049 3109914 5255 days $113511.86 107 days $2233.95 NA NA NA Random Search (SGD + Architecture) 80.12 4000 208729 353 days $7618.61 174 hours $149.94 here 4.42 0.03 SigOpt Search (SGD + Architecture) 81.00 400 30060 51 days $1097.19 25 hours $21.59 here 5.30 0.25 % Δ per Comp $ is calculated using GPU compute
  43. Results: the experimentation loop The training pipeline: from data to optimization and evaluation
  44. SigOpt. Confidential. How to tune models with expensive training costs 4
  45. SigOpt. Confidential. How to efficiently minimize time to optimize any function SigOpt’s multitask feature is an efficient way for modelers to tune model with an expensive training cost with the benefit of: • Faster time-to-market — The ability to bring expensive models into production faster • Reduction in infrastructure cost — Intelligently leverage infrastructure while reducing cost Through novel research SigOpt helps the user lower the overall time-to-market, while reducing the overall compute budget. 45 Expensive Training Cost
  46. SigOpt. Confidential. Your firewall Training Data AI, ML, DL, Simulation Model Model Evaluation or Backtest Testing Data New Configurations Objective Metric Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Configuration Parameters or Hyperparameters Expensive Training Cost
  47. SigOpt. Confidential. Better Results EXPERIMENT INSIGHTS Track, organize, analyze and reproduce any model ENTERPRISE PLATFORM Built to fit any stack and scale with your needs OPTIMIZATION ENGINE Explore and exploit with a variety of techniques RESTAPI Expensive Training Cost
  48. SigOpt. Confidential. Using cheap or free information to speed learning 48 Sources: Aaron Klein, Frank Hutter, et al.: https://arxiv.org/abs/1605.07079 SigOpt allows to the user to define lower-cost functions in order to quickly optimize expensive functions • Cheaper-cost functions can be flexible (fewer epochs, subsampled data, other custom features) • Use cheaper tasks earlier in the tuning process to explore • Inform more expensive tasks later by exploiting what we learn • In the process, reduce the full time required to tune an expensive model Expensive Training Cost
  49. SigOpt. Confidential. Using cheap or free information to speed learning We can build better models using inaccurate data to help point the actual optimization in the right direction with less cost. • Using a warm start through multi-task learning logic [Swersky et al, 2014] • Combining good anytime performance with active learning [Klein et al, 2018] • Accepting data from multiple sources without priors [Poloczek et al, 2017] 49 Expensive Training Cost
  50. Use Case: Image Classification on a Budget Use Case ● Category: Computer Vision ● Task: Image Classification ● Model: CNN ● Data: Stanford Cars Dataset ● Analysis: Architecture Comparison ● Result: 2.4% accuracy gain with a much shallower model Learn more https://mlconf.com/blog/insights-for-building-high-performing- image-classification-models/
  51. SigOpt. Confidential. Architecture: Classifying images of cars using ResNet 51 Convolutions Classification ResNet Input Acura TLX 2015 Output Label Sources: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: https://arxiv.org/abs/1512.03385
  52. SigOpt. Confidential. Training setup comparison ImageNet Pretrained Convolutional Layers Fully Connected Layer ImageNet Pretrained Convolutional Layers Fully Connected Layer Input Convolutional Features Classification Input Convolutional Features Classification Fine Tuning Feature Extractor Tuning Tuning
  53. SigOpt. Confidential. Hyperparameter setup 53 Hyperparameter Lower Bound Upper Bound Log Learning Rate 1.2e-4 1.0 Learning Rate Scheduler 0 0.99 Batch Size (powers of 2) 16 256 Nesterov False True Log Weight Decay 1.2e-5 1.0 Momentum 0 0.9 Scheduler Step 1 20
  54. SigOpt. Confidential.54 Insight: Multitask efficiency at the hyperparameter level Example: Learning rate accuracy and values by cost of task over time Progression of observations over time Accuracy and value for each observation Parameter importance analysis
  55. SigOpt. Confidential. Fine-tuning the smaller network significantly outperforms feature extraction on a bigger network Results: Optimizing and tuning the full network outperforms 55 Multitask optimization drives significant performance gains +3.92% +1.58%
  56. SigOpt. Confidential. Implication: Fine-tuning significantly outperforms Cost Breakdown for Multitask Optimization Cost efficiency Feature Extractor ResNet 50 Fine-Tuning ResNet 18 Hours per training 4.08 4.2 Observations 220 220 Number of Runs 1 1 Total compute hours 898 924 Cost per GPU-hour $0.90 $0.90 % Improvement 1.58% 3.92% Total compute cost $808 $832 cost ($) per % improvement $509 $212 Similar Compute Cost Similar Wall-Clock Time Fine-Tuning Significantly More Efficient and Effective
  57. SigOpt. Confidential. Thank you
Advertisement