Valencian Summer School in Machine Learning
4rd edition
September 13-14, 2018
BigML, Inc 2
OptiML
Hands-Free Parameter Tuning
Charles Parker
VP Algorithms, BigML, Inc
BigML, Inc 3
Parameter Optimization
• There are lots of algorithms and lots of parameters
• We don’t have time to try even close to everything
• If only we had a way to make a prediction . . .
Did I hear someone say
Machine Learning?
BigML, Inc 4
The Allure of ML
“Why don’t we just use
machine learning to predict
the quality of a set of
modeling parameters before
we train a model on them?”
— Every first year ML grad student ever
BigML, Inc 5
In This Talk
• Technology Review
• Metric Selection
• The Dangers of Naive Cross-validation
• Selecting the “Best” Model
• Caveat Emptor!
BigML, Inc 6
• The performance of an ML algorithm (with associated parameters) is
data dependent
• So: Learn from your previous attempts
• Train a model, then evaluate it
• After you’ve done a number of evaluations, learn a regression model
to predict the performance of future, as-yet-untrained models
• Use this classifier to chose a promising set of “next models”
• Sound Familiar?
Bayesian Parameter Optimization
BigML, Inc 7
Model and
EvaluateParameters 1
Parameters 2
Parameters 3
Parameters 4
Parameters 5
Parameters 6
0.75
0.56
0.92
Machine Learning!
parameters ⟶ performance
Bayesian Parameter Optimization
BigML, Inc 8
Some Other Tricks
• Use metalearning to select a good set of initial candidates
• Cross-validation is expensive, and there’s no reason to do
it for models with terrible performance; stop early in these
cases
BigML, Inc 9Xxxxxx
Metric Selection
BigML, Inc 10
A Metric Selection Flowchart
Will you
bother about
threshold setting?
Is your dataset
imbalanced?
Is yours a
“ranking” problem?
Do you
care more about
the top-ranked
instances?
Phi coefficient
f-mesure Accuracy
Max. Phi
KS-statistic
Area Under the ROC / PR curve
Kendall’s Tau
Spearman’s Rho
Yes
Yes
Yes
No
No
No
Yes
No
BigML, Inc 11
Ranking Problems
Medical Diagnosis (no) vs. Stock Picking (yes)
BigML, Inc 12
Top-heavy Importance
Draft-Style Selections (no) vs. Customer Churn (yes)
BigML, Inc 13Xxxxxx
The Dangers of Naive Cross-validation
BigML, Inc 14
Is Cross-Validation Right for You?
• Cross-validation is a good tool some of the time
• Many Other times, it is disastrously bad
• Overly optimistic
• False confidence in results
• This is why we offer the option for a specific holdout set
BigML, Inc 15
Case #1: Market Direction
• Suppose you want to predict the direction of the stock market, or any particular
stock (Disclaimer: this is hard)
• You have information for that market for each minute of each day
• But minutes next to each other are dramatically correlated in both the input and
objective field
• So if you have the answer for one minute, you can trivially predict the rest!
• Cross-validation will tell you your classifier is near-perfect!
BigML, Inc 16
Case #2: Photo Age Prediction
• Suppose you want to predict the age of a printed photograph (based on dye-
fade, paper watermarks, the presence and type of border, etc.)
• Your training set: A few thousand photos from a few dozen people
• But the age of one person’s photos are correlated in both the input and output
spaces! (same age, camera, storage conditions, etc.)
• So you can trivially do well predicting the age of some of one person’s photos if
you know the ages of the rest
• Cross-validation will tell you your classifier is near perfect!
BigML, Inc 17
Take Care!
• These situations are very common in all
cases where data comes in batches
(days, users, etc.)
• The solution is to hold out whole batches
of data (e.g., a specific test set) rather
than just random points from each one
(as in cross-validation)
• It’s possible that it isn’t a problem in your
dataset, but when in doubt, try both!
BigML, Inc 18Xxxxxx
Selecting the “Best” Model
BigML, Inc 19Xxxxxx
Which Model is Best?
• Performance isn’t the only issue!
• Retraining: Will the amount of data you have be different in the future?
• Fit stability: How confident must you be that the model’s behavior is invariant
to small data changes?
• Prediction speed: The difference can be orders of magnitude
BigML, Inc 20Xxxxxx
Modeling Tradeoffs
Interpretability vs. Representability
Weak vs. Slow
Confidence vs. Performance
Biased vs. Data-hungry
Simple
(Logistic)
Complex
(Deepnets)
BigML, Inc 21Xxxxxx
Caveat Emptor!
BigML, Inc 22
Mo’ Problems
• Model selection tends to take a lot of
data, and the more accurate you
want the search to be, the more data
you need.
• We had to define a search space that
would suit “most” datasets. It’s
possible that the right model for your
data isn’t in there!
BigML, Inc 23
Fusions
Just Slam A Bunch of Stuff Together
Charles Parker
VP Algorithms, BigML, Inc
BigML, Inc 24
• Diving into Fusions
• Some Pros and Cons
• Aside: Prediction Explanations
• Creating a Diverse Ensemble
Much Ado About Fusions
BigML, Inc 25
Mixture of Experts
Prediction!
BigML, Inc 26
Mixture of Experts
Prediction!
?
BigML, Inc 27
Mixture of Experts
BigML, Inc 28
Ensemble?
Prediction!Aggregate!
BigML, Inc 29
Creating a Fusion
BigML, Inc 30
Ensemble?
Prediction!Aggregate!
BigML, Inc 31
Fusion = Diverse Ensemble
Prediction!Aggregate!
BigML, Inc 32
Other Techniques?
Prediction!Aggregate!
BigML, Inc 33
Stacking
Prediction!
BigML, Inc 34
Boosting
Prediction!
BigML, Inc 35Xxxxxx
Some Pros and Cons
BigML, Inc 36
• A bit wobbly
• Regions of the input space might
have under-performing predictions
• Probably pretty fast
• With OptiML, it’s the best thing we
could find
Fusions vs. Single Models
• More stable
• Errors tend to be “smoothed out”
across the entire input space
• Maybe somewhat slow
• You’ll have to do some additional
validation to check performance
FusionsSingle Models
BigML, Inc 37
What About Performance?
• This is not typically a step that will result in huge performance gains, unless
you’ve got significant feature diversity
• You’re usually better off feature engineering / acquiring more data
• Do it for stability
• . . . or to improve the importance profile
BigML, Inc 38Xxxxxx
Importance Tuning
BigML, Inc 39
Feature Importance
• Importance is measured in different ways depending on the model type
• This is the importance given under
• Global importance is different from local importance!
• This is given by prediction explanations
BigML, Inc 40Xxxxxx
Global Importance
• What’s really important? Does it make sense?
BigML, Inc 41Xxxxxx
Local Importance
BigML, Inc 42Xxxxxx
Creating a Diverse Ensemble
BigML, Inc 43Xxxxxx
Fusions Love Diversification
• Fusions work better if the predictions of the constituent models are all good
but not correlated
• One way to increase the chances of this is to use different feature sets that
are not well-correlated
• Text provides a good opportunity do to this because so many possible
features can be generated from text data
BigML, Inc 44Xxxxxx
Text Feature Makeover
• Stem / don’t stem
• Change aggressiveness of stop word removal
• Longer n-grams / Ignore unigrams
VSSML18. OptiML and Fusions

VSSML18. OptiML and Fusions

  • 1.
    Valencian Summer Schoolin Machine Learning 4rd edition September 13-14, 2018
  • 2.
    BigML, Inc 2 OptiML Hands-FreeParameter Tuning Charles Parker VP Algorithms, BigML, Inc
  • 3.
    BigML, Inc 3 ParameterOptimization • There are lots of algorithms and lots of parameters • We don’t have time to try even close to everything • If only we had a way to make a prediction . . . Did I hear someone say Machine Learning?
  • 4.
    BigML, Inc 4 TheAllure of ML “Why don’t we just use machine learning to predict the quality of a set of modeling parameters before we train a model on them?” — Every first year ML grad student ever
  • 5.
    BigML, Inc 5 InThis Talk • Technology Review • Metric Selection • The Dangers of Naive Cross-validation • Selecting the “Best” Model • Caveat Emptor!
  • 6.
    BigML, Inc 6 •The performance of an ML algorithm (with associated parameters) is data dependent • So: Learn from your previous attempts • Train a model, then evaluate it • After you’ve done a number of evaluations, learn a regression model to predict the performance of future, as-yet-untrained models • Use this classifier to chose a promising set of “next models” • Sound Familiar? Bayesian Parameter Optimization
  • 7.
    BigML, Inc 7 Modeland EvaluateParameters 1 Parameters 2 Parameters 3 Parameters 4 Parameters 5 Parameters 6 0.75 0.56 0.92 Machine Learning! parameters ⟶ performance Bayesian Parameter Optimization
  • 8.
    BigML, Inc 8 SomeOther Tricks • Use metalearning to select a good set of initial candidates • Cross-validation is expensive, and there’s no reason to do it for models with terrible performance; stop early in these cases
  • 9.
  • 10.
    BigML, Inc 10 AMetric Selection Flowchart Will you bother about threshold setting? Is your dataset imbalanced? Is yours a “ranking” problem? Do you care more about the top-ranked instances? Phi coefficient f-mesure Accuracy Max. Phi KS-statistic Area Under the ROC / PR curve Kendall’s Tau Spearman’s Rho Yes Yes Yes No No No Yes No
  • 11.
    BigML, Inc 11 RankingProblems Medical Diagnosis (no) vs. Stock Picking (yes)
  • 12.
    BigML, Inc 12 Top-heavyImportance Draft-Style Selections (no) vs. Customer Churn (yes)
  • 13.
    BigML, Inc 13Xxxxxx TheDangers of Naive Cross-validation
  • 14.
    BigML, Inc 14 IsCross-Validation Right for You? • Cross-validation is a good tool some of the time • Many Other times, it is disastrously bad • Overly optimistic • False confidence in results • This is why we offer the option for a specific holdout set
  • 15.
    BigML, Inc 15 Case#1: Market Direction • Suppose you want to predict the direction of the stock market, or any particular stock (Disclaimer: this is hard) • You have information for that market for each minute of each day • But minutes next to each other are dramatically correlated in both the input and objective field • So if you have the answer for one minute, you can trivially predict the rest! • Cross-validation will tell you your classifier is near-perfect!
  • 16.
    BigML, Inc 16 Case#2: Photo Age Prediction • Suppose you want to predict the age of a printed photograph (based on dye- fade, paper watermarks, the presence and type of border, etc.) • Your training set: A few thousand photos from a few dozen people • But the age of one person’s photos are correlated in both the input and output spaces! (same age, camera, storage conditions, etc.) • So you can trivially do well predicting the age of some of one person’s photos if you know the ages of the rest • Cross-validation will tell you your classifier is near perfect!
  • 17.
    BigML, Inc 17 TakeCare! • These situations are very common in all cases where data comes in batches (days, users, etc.) • The solution is to hold out whole batches of data (e.g., a specific test set) rather than just random points from each one (as in cross-validation) • It’s possible that it isn’t a problem in your dataset, but when in doubt, try both!
  • 18.
    BigML, Inc 18Xxxxxx Selectingthe “Best” Model
  • 19.
    BigML, Inc 19Xxxxxx WhichModel is Best? • Performance isn’t the only issue! • Retraining: Will the amount of data you have be different in the future? • Fit stability: How confident must you be that the model’s behavior is invariant to small data changes? • Prediction speed: The difference can be orders of magnitude
  • 20.
    BigML, Inc 20Xxxxxx ModelingTradeoffs Interpretability vs. Representability Weak vs. Slow Confidence vs. Performance Biased vs. Data-hungry Simple (Logistic) Complex (Deepnets)
  • 21.
  • 22.
    BigML, Inc 22 Mo’Problems • Model selection tends to take a lot of data, and the more accurate you want the search to be, the more data you need. • We had to define a search space that would suit “most” datasets. It’s possible that the right model for your data isn’t in there!
  • 23.
    BigML, Inc 23 Fusions JustSlam A Bunch of Stuff Together Charles Parker VP Algorithms, BigML, Inc
  • 24.
    BigML, Inc 24 •Diving into Fusions • Some Pros and Cons • Aside: Prediction Explanations • Creating a Diverse Ensemble Much Ado About Fusions
  • 25.
    BigML, Inc 25 Mixtureof Experts Prediction!
  • 26.
    BigML, Inc 26 Mixtureof Experts Prediction! ?
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
    BigML, Inc 31 Fusion= Diverse Ensemble Prediction!Aggregate!
  • 32.
    BigML, Inc 32 OtherTechniques? Prediction!Aggregate!
  • 33.
  • 34.
  • 35.
  • 36.
    BigML, Inc 36 •A bit wobbly • Regions of the input space might have under-performing predictions • Probably pretty fast • With OptiML, it’s the best thing we could find Fusions vs. Single Models • More stable • Errors tend to be “smoothed out” across the entire input space • Maybe somewhat slow • You’ll have to do some additional validation to check performance FusionsSingle Models
  • 37.
    BigML, Inc 37 WhatAbout Performance? • This is not typically a step that will result in huge performance gains, unless you’ve got significant feature diversity • You’re usually better off feature engineering / acquiring more data • Do it for stability • . . . or to improve the importance profile
  • 38.
  • 39.
    BigML, Inc 39 FeatureImportance • Importance is measured in different ways depending on the model type • This is the importance given under • Global importance is different from local importance! • This is given by prediction explanations
  • 40.
    BigML, Inc 40Xxxxxx GlobalImportance • What’s really important? Does it make sense?
  • 41.
  • 42.
    BigML, Inc 42Xxxxxx Creatinga Diverse Ensemble
  • 43.
    BigML, Inc 43Xxxxxx FusionsLove Diversification • Fusions work better if the predictions of the constituent models are all good but not correlated • One way to increase the chances of this is to use different feature sets that are not well-correlated • Text provides a good opportunity do to this because so many possible features can be generated from text data
  • 44.
    BigML, Inc 44Xxxxxx TextFeature Makeover • Stem / don’t stem • Change aggressiveness of stop word removal • Longer n-grams / Ignore unigrams