Advertisement

Nov. 14, 2016•0 likes## 4 likes

•1,657 views## views

Be the first to like this

Show More

Total views

0

On Slideshare

0

From embeds

0

Number of embeds

0

Download to read offline

Report

Technology

Using Bayesian Optimization to Tune Machine Learning Models: In this talk we briefly introduce Bayesian Global Optimization as an efficient way to optimize machine learning model parameters, especially when evaluating different parameters is time-consuming or expensive. We will motivate the problem and give example applications. We will also talk about our development of a robust benchmark suite for our algorithms including test selection, metric design, infrastructure architecture, visualization, and comparison to other standard and open source methods. We will discuss how this evaluation framework empowers our research engineers to confidently and quickly make changes to our core optimization engine. We will end with an in-depth example of using these methods to tune the features and hyperparameters of a real world problem and give several real world applications.

MLconfFollow

MLconfAdvertisement

Advertisement

Advertisement

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf

Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016MLconf

Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf

Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf

Corinna Cortes, Head of Research, Google, at MLconf NYC 2017MLconf

Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf

- BAYESIAN GLOBAL OPTIMIZATION Using Optimal Learning to Tune ML Models Scott Clark scott@sigopt.com
- OUTLINE 1. Why is Tuning ML Models Hard? 2. Standard Tuning Methods 3. Bayesian Global Optimization 4. Comparing Optimizers 5. Real World Examples
- Machine Learning is extremely powerful
- Machine Learning is extremely powerful Tuning Machine Learning systems is extremely non-intuitive
- https://www.quora.com/What-is-the-most-important-unresolved-problem-in-machine-learning-3 What is the most important unresolved problem in machine learning? “...we still don't really know why some configurations of deep neural networks work in some case and not others, let alone having a more or less automatic approach to determining the architectures and the hyperparameters.” Xavier Amatriain, VP Engineering at Quora (former Director of Research at Netflix)
- Photo: Joe Ross
- TUNABLE PARAMETERS IN DEEP LEARNING
- TUNABLE PARAMETERS IN DEEP LEARNING
- TUNABLE PARAMETERS IN DEEP LEARNING
- Photo: Tammy Strobel
- STANDARD METHODS FOR HYPERPARAMETER SEARCH
- EXAMPLE: FRANKE FUNCTION
- Grid Search Random Search
- Predictive Models Predictive Models TUNING MACHINE LEARNING MODELS New parameters Objective Metric Better Models Big Data
- BAYESIAN GLOBAL OPTIMIZATION
- … the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - Princeton What is the most efficient way to collect information? Prof. Peter Frazier - Cornell How do we make the most money, as fast as possible? Scott Clark - CEO, SigOpt OPTIMAL LEARNING
- ● Optimize objective function ○ Loss, Accuracy, Likelihood ● Given parameters ○ Hyperparameters, feature parameters ● Find the best hyperparameters ○ Sample function as few times as possible ○ Training on big data is expensive BAYESIAN GLOBAL OPTIMIZATION
- 1. Build Gaussian Process (GP) with points sampled so far 2. Optimize the fit of the GP (covariance hyperparameters) 3. Find the point(s) of highest Expected Improvement within parameter domain 4. Return optimal next best point(s) to sample HOW DOES IT WORK?
- GAUSSIAN PROCESSES
- GAUSSIAN PROCESSES
- GAUSSIAN PROCESSES
- GAUSSIAN PROCESSES
- GAUSSIAN PROCESSES
- GAUSSIAN PROCESSES
- GAUSSIAN PROCESSES
- GAUSSIAN PROCESSES
- overfit good fit underfit GAUSSIAN PROCESSES
- EXPECTED IMPROVEMENT
- EXPECTED IMPROVEMENT
- EXPECTED IMPROVEMENT
- EXPECTED IMPROVEMENT
- EXPECTED IMPROVEMENT
- EXPECTED IMPROVEMENT
- EVALUATING THE OPTIMIZER
- What is the best value found after optimization completes? METRIC: BEST FOUND BLUE RED BEST_FOUND 0.7225 0.8949
- How quickly is optimum found? (area under curve) METRIC: AUC BLUE RED BEST_FOUND 0.9439 0.9435 AUC 0.8299 0.9358
- ● Optimization functions (eg Branin, Ackeley, Rosenbrock) ● ML datasets (LIBSVM) BENCHMARK SUITE TEST FUNCTION TYPE COUNT Continuous Params 184 Noisy Observations 188 Parallel Observations 45 Integer Params 34 Categorical Params / ML 47 Failure Observations 30 TOTAL 489
- ● On-demand cluster in AWS for parallel eval function optimization ● Full eval consists of ~10000 optimizations, taking ~4 hours INFRASTRUCTURE
- VIZ TOOL : BEST SEEN TRACES
- METRICS: STOCHASTICITY ● Run each 20 times ● Mann-Whitney U test for significance
- RANKING OPTIMIZERS ● Alternate methods exist for black box optimization : Spearmint, TPE, SMAC, PSO, RND Search, Grid Search ● Important to understand / track method performance disparity on high-level categories of functions ● For a given test function, want a partial ranking (allowing for ties) of method performance
- RANKING OPTIMIZERS ● First, Mann-Whitney U tests using BEST_FOUND ● Tied results then partially ranked using AUC ● Any remaining ties, stay as ties for final ranking
- RANKING AGGREGATION ● Aggregate partial rankings across all eval functions using Borda count (sum of methods ranked lower)
- SHORT RESULTS SUMMARY
- SIGOPT SERVICE
- Predictive Models Predictive Models HOW DOES SIGOPT INTEGRATE? New parameters Objective Metric Better Models Big Data
- SIMPLIFIED MANAGEMENT Before SigOpt
- DISTRIBUTED MODEL TRAINING ● SigOpt serves as an AWS-ready distributed scheduler for training models across workers ● Each worker accesses the SigOpt API for the latest parameters to try ● Enables distributed training of non-distributed algorithms
- INTEGRATIONS REST API
- Questions? scott@sigopt.com @DrScottClark https://sigopt.com @SigOpt
- SHORT EXAMPLES
- EXAMPLE: LOAN DATA Loan Applications Default Prediction with tunable ML parameters ● Income ● Credit Score ● Loan Amount New parameters Prediction Accuracy Better Accuracy
- COMPARATIVE PERFORMANCE Accuracy Grid Search Random Search AUC .698 .690 .683 .675 $1,000 100 hrs $10,000 1,000 hrs $100,000 10,000 hrs Cost ● Better: 22% fewer bad loans vs baseline ● Faster/Cheaper: 100x less time and AWS cost than standard tuning methods
- EXAMPLE: ALGORITHMIC TRADING Market Data Trading Strategy with tunable weights and thresholds ● Closing Prices ● Day of Week ● Market Volatility New parameters Expected Revenue Higher Returns
- COMPARATIVE PERFORMANCE Standard Method Expert ● Better: 200% Higher model returns than expert ● Faster/Cheaper: 10x faster than standard methods
- 1. SigOpt Live Demo 2. More Examples a. Text Classification b. Unsupervised + Supervised c. Neural Nets with TensorFlow ADDITIONAL TOPICS
- AUTOMATICALLY TUNING TEXT SENTIMENT CLASSIFIER
- ● Automatically tune text sentiment classifier ● Amazon product review dataset (35K labels) eg : “This DVD is great. It brings back all the memories of the holidays as a young child.” ● Logistic regression is generally a good place to start PROBLEM
- ● Maximize mean of k-fold cross-validation accuracies ● k = 5 folds, train and valid randomly split 70%, 30% OBJECTIVE FUNCTION
- ● n-gram vocabulary selection parameters ● (min_n_gram, ngram_offset) determine which n-grams ● (log_min_df, df_offset) filter for n-grams within df range TEXT FEATURE PARAMETERS Original Text “SigOpt optimizes any complicated system” 1-grams { “SigOpt”, “optimizes”, “any”, “complicated”, “system”} 2-grams { “SigOpt_optimizes”, “optimizes_any”, “any_complicated” … } 3-grams { “SigOpt_optimizes_any”, “optimizes_any_complicated” … }
- ● Logistic regression error cost parameters M = number of training examples θ = vector of weights the algorithm will learn for each n-gram in vocabulary yi - training data label : {-1, 1} for our two class problem xi - training data input vector: BOW vectors described in previous section α - weight of regularization term (log_reg_coef in our experiment) ρ - weight of l1 norm term (l1_coef in our experiment) ERROR COST PARAMETERS
- ● 50 line python snippet to train and tune classifier with SigOpt ● 20 lines to define 6 parameter experiment and run optimization loop using SigOpt PYTHON CODE
- ● E[f (λ)] after 20 runs, each run consisting of 60 function evaluations ● For Grid Search : 64 evenly spaced parameter configurations (order shuffled randomly) ● SigOpt statistical significance over grid and rnd (p = 0.0001, Mann-Whitney U test) PERFORMANCE SigOpt Rnd. Search Grid Search No Tuning (Baseline) Best Found 0.8760 (+5.72%) 0.8673 (+4.67%) 0.8680 (+4.76%) 0.8286
- EXPLOITING UNLABELLED DATA
- ● Classify house number digits with lack of labelled data ● Challenging digit variations, image clutter with neighboring digits PROBLEM
- ● In general we’ll search for an optimized ML pipeline OBJECTIVE
- ● Transform image patches into vectors of centroid distances, then pool to form final representation ● SigOpt optimizes selection of w, pool_r, K UNSUPERVISED MODEL PARAMS
- ● Whitening transform often useful as image data pre-processing step, expose εZCA to SigOpt UNSUPERVISED MODEL PARAMS
- ● Tune sparsity of centroid distance transform ● SigOpt optimizes threshold (active_p) selection UNSUPERVISED MODEL PARAMS
- ● learning rate number of trees tree parameters (max_depth, sub_sample_sz), exposed to SigOpt SUPERVISED MODEL PARAMS
- METRIC OPTIMIZATION
- ● 20 optimization runs, each run consisting of 90 / 40 function evaluations for Unsup / Raw feature settings ● Optimized single CV fold on training set, ACC reported on test set as hold out PERFORMANCE SigOpt (Unsup Feats) Rnd Search (Unsup Feats) SigOpt (Raw Feats) Rnd Search (Raw Feats) No Tuning RF (Raw Feats) Hold Out ACC 0.8708 (+51.4%) 0.8583 0.6844 0.6739 0.5751
- EFFICIENTLY BUILDING CONVNETS
- ● Classify house numbers with more training data and more sophisticated model PROBLEM
- ● TensorFlow makes it easier to design DNN architectures, but what structure works best on a given dataset? CONVNET STRUCTURE
- ● Per parameter adaptive SGD variants like RMSProp and Adagrad seem to work best ● Still require careful selection of learning rate (α), momentum (β), decay (γ) terms STOCHASTIC GRADIENT DESCENT
- ● Comparison of several RMSProp SGD parametrizations ● Not obvious which configurations will work best on a given dataset without experimentation STOCHASTIC GRADIENT DESCENT
- METRIC OPTIMIZATION
- ● Avg Hold out accuracy after 5 optimization runs consisting of 80 objective evaluations ● Optimized single 80/20 CV fold on training set, ACC reported on test set as hold out PERFORMANCE SigOpt (TensorFlow CNN) Rnd Search (TensorFlow CNN) No Tuning (sklearn RF) No Tuning (TensorFlow CNN) Hold Out ACC 0.8130 (+315.2%) 0.5690 0.5278 0.1958
- COST ANALYSIS Model Performance (CV Acc. threshold) Random Search Cost SigOpt Cost SigOpt Cost Savings Potential Savings In Production (50 GPUs) 87 % $275 $42 84% $12,530 85 % $195 $23 88% $8,750 80 % $46 $21 55% $1,340 70 % $29 $21 27% $400
- https://sigopt.com/getstarted Try it yourself!
- MORE EXAMPLES Automatically Tuning Text Classifiers (with code) A short example using SigOpt and scikit-learn to build and tune a text sentiment classifier. Tuning Machine Learning Models (with code) A comparison of different hyperparameter optimization methods. Using Model Tuning to Beat Vegas (with code) Using SigOpt to tune a model for predicting basketball scores. Learn more about the technology behind SigOpt at https://sigopt.com/research

Advertisement