Learning to Optimize

- Pramit Choudhary
Introduction
• What is a Machine Learning Model ?
• What is Hyper-parameter Optimization ?
• HPO pipeline
• Questions
Machine Learning Model
• A mathematical function representing the
relationship between aspects of the data
• Simplest Model: Linear Regression Model
– WtX=Y, where
• X: Vector representing features of the data
• Y: Scalar variable representing the target
• W: weight vector , that specifies the slope of the equation.
This is a model parameter learned during Model Training.
• Loss Function: Capture the quality of the model
based on prediction and ground truth
• A Typical Linear Regression:
• Loss Function Representation:
Ways to Optimize Loss
• Gradient Descent
Performing gradient descent while minimizing loss
function
Hyper-Parameter Optimization
• Parameter tuning
Parameter tuning at different Levels
Credit: Dato/Turi
• What is hyper-parameter ?
– Also known as nuisance parameters.
– Are values specified outside of the training process
E.g.
• Linear Regression
– Ridge Regression/LASSO: Weight for regularized term
• Logistic Regression
– Regularization parameters
• Decision Tree
– Desired Depth, split criteria
• SVM
– Penalty factor
– Kernel parameters, e.g. width for Radial Basic Function, degree for
polynomial function
• Optimizing Loss for Linear learners
– Learning rate, Decay Rate
Why is it important ?
• Determines how rigid the derived model
is(how feature dependent the model is).
• Proper tuning helps in reducing over-
fitting( generalizing the model)
• Helps in improving the accuracy of the
trained model.
Why is it a difficult task?
• Selecting an algorithm for Data Discovery
– Selecting the right algorithm is very important
• Post deciding on an algorithm, the process
of parameter selection is time consuming
• Parameter values can not be defined as a
closed form formula, result of a Black box.
• There is only so much you can do at one
point in time restricted by hardware.
Algorithms for Parameter tuning
• Grid Search
– Exhaustive parameter sweep over the hyper-
parameter space
– E.g. SVM with some kernel
– Train on the Cartesian product of constant(c), kernel
parameter(Y)
– Suffers from Curse of Dimensionality i.e. the space
to search the parameter value grows exponentially,
data becomes sparse and difficult to search on
• Random Search( Implemented )
– Similar to Grid Search
– Randomized selection of the parameter values
• Gradient Based Optimization
– Specialized algorithm for minimizing the
generalization error in SVM
• Bayesian Optimization(Future)
– Sequential design for finding global optimization of a
black box function
– Based on developing a statistical model mapping a
function from hyper-parameter values -> objective
evaluated.
– Initially placing a prior on the random function and
then updating to form a posterior distribution which
determines the next query point.
– Reference: http://papers.nips.cc/paper/4522-
practical-bayesian-optimization-of-machine-learning-
algorithms.pdf
Apache Spark HPO
spark-submit --driver-memory 8g --executor-memory 8g --num-executors 2 --executor-cores 4 –
class ../modellearner.ExecutionManager --master local[*] model-learner-0.0.1-jar-with-
dependencies.jar --help
Experimental Model Learner.
For usage see below:
-a, --aloha-spec-file <arg> Specify the aloha spec file name with path
-c, --config-file <arg> Specify the config file name with path
-n, --negative-downSampling <arg> Specify negative down-sampling needed as a
percentage (default = 0.0)
-q, --query_data <arg> Specify false if there is no need query for
data(useful if data has been queried earlier)
(default = true)
-s, --seed-value <arg> Specify seed value (default = 0)
-t, --top-n-values <arg> Specify the number for top n loss values (default = 10)
--help Show help message
*Currently supports only vw, can be extended to include others.
Config Format
properties:

framework: 'vw'

cmdPath: 'vw'

vwBasic: 'vw --noop -k --cache_file'

bitPrecision: ’22'

updatePolicies:
loss_function:logistic;learning_rate:
#0.01,0.02#;l1:#0.001,0.9#;l2:#0.0001,0.00#'

manipulationPolicies: '-q JY -q JW -q IY -q IW -q YW -q JI'

trainSplitPercentage: '0.8'

iteration: '1'

errorMetric: 'average loss'

generateCache: 'true'

sharedDir: ‘<xxxx>'

dateRange: '2015-01-03 2015-01-04'

region: ’yyy'

matchType: ’<match_type>'

hdfsBaseDir: ‘<hdfs_path>'

hdfsOutputDir: ‘<output_path>'

learnerOutputHdfs: ’<learner_output_path>'
Fun with Math
Random or Grid or any other Bayesian method,
how can one prepare the infrastructure for
optimizing the learner ? i.e Number of iterations ?
Lets see
• Assume, chance = 0.05 and Guarantee: 0.95
• Probability of missing the optimum in ‘n’ iterations: (1-0.05)^n
• Then Probability of success = 1 – (1 – 0.05)^n >=0.95
• Guess ?
Future
• Extend it to support other frameworks such as
– R
– Lib-SVM
– Generalized vw implementation
• Support feature exploration
• Other forms of Adaptive Search instead of
Random Search
• Gradient based HyperParameter – Reversible
Learning
– Reference: http://arxiv.org/pdf/
1502.03492v3.pdf
Other Frameworks
• vw- hypersearch https://github.com/JohnLangford/
vowpal_wabbit/wiki/Using-vw-hypersearch
• Python based Hyperopt
– No interaction or learning from previous runs
– Can’t pause and resume
– Parallelization is dependent on MongoDB( opens up
another bag of worms )
• Auto Weka: on top of Weka( Difficult to customize
and support other frameworks)
• Spearmint: Only supports Bayesian Optimization
– Difficult to customize to support other machine
learning framework.
Example using HyperOpt
Introducing VW
• Is a efficient scalable online Machine
Learning framework
• Also supports importance weighting,
multiple loss functions and other
optimization algorithms.
• Supports dynamic generation of feature
interaction
• Test set hold-out and early termination on
multiple passes
VW Scalability
• Out of core learning, no need to load all
the data in memory at once
• Applies feature hashing to convert feature
identities to a weight index using
mumurHash3.
For curious mind: https://gist.github.com/cartazio/2903178
• Effective use of the multi-cores
• Written in c++, avoids nuances of jvm
Demo of vw
• vw file format
[Label] [Importance] [Tag]|Namespace Features |Namespace Features ... |
Namespace Features
• Label: is the real number that we are trying to predict for this example
• Importance: (importance weight) is a non-negative real number indicating
the relative importance of this example over the others.
• Tag: is a string that serves as an identifier for the example
• Namespace: is an identifier of a source of information for the example
• Features: is a sequence of whitespace separated strings, each of which is
optionally followed by a float
• Note: ** vertical bar, colon, space, and newline
Loss Functions
• Stackover flow multiclass prediction
problem
Link to the problem: https://www.kaggle.com/c/predict-
closed-questions-on-stack-overflow
• Steps followed:
– pre-process CSV data
– convert it into VW format
– run learning and prediction
– Evaluate the model
• Has a bunch of features, but for this demo
the following features are used:
– title, body, tags
• Algorithm: OAA( One Against All )
References
• Random Search for Hyper-parameter
optimization: http://www.jmlr.org/
papers/volume13/bergstra12a/
bergstra12a.pdf
• Practical Bayesian Optimization of ML
Algorithms: https://dash.harvard.edu/
handle/1/11708816

Learning to Optimize

  • 1.
    Learning to Optimize
 -Pramit Choudhary
  • 2.
    Introduction • What isa Machine Learning Model ? • What is Hyper-parameter Optimization ? • HPO pipeline • Questions
  • 3.
    Machine Learning Model •A mathematical function representing the relationship between aspects of the data • Simplest Model: Linear Regression Model – WtX=Y, where • X: Vector representing features of the data • Y: Scalar variable representing the target • W: weight vector , that specifies the slope of the equation. This is a model parameter learned during Model Training. • Loss Function: Capture the quality of the model based on prediction and ground truth
  • 4.
    • A TypicalLinear Regression: • Loss Function Representation:
  • 5.
    Ways to OptimizeLoss • Gradient Descent Performing gradient descent while minimizing loss function
  • 6.
    Hyper-Parameter Optimization • Parametertuning Parameter tuning at different Levels Credit: Dato/Turi
  • 7.
    • What ishyper-parameter ? – Also known as nuisance parameters. – Are values specified outside of the training process E.g. • Linear Regression – Ridge Regression/LASSO: Weight for regularized term • Logistic Regression – Regularization parameters • Decision Tree – Desired Depth, split criteria • SVM – Penalty factor – Kernel parameters, e.g. width for Radial Basic Function, degree for polynomial function • Optimizing Loss for Linear learners – Learning rate, Decay Rate
  • 8.
    Why is itimportant ? • Determines how rigid the derived model is(how feature dependent the model is). • Proper tuning helps in reducing over- fitting( generalizing the model) • Helps in improving the accuracy of the trained model.
  • 9.
    Why is ita difficult task? • Selecting an algorithm for Data Discovery – Selecting the right algorithm is very important • Post deciding on an algorithm, the process of parameter selection is time consuming • Parameter values can not be defined as a closed form formula, result of a Black box. • There is only so much you can do at one point in time restricted by hardware.
  • 10.
    Algorithms for Parametertuning • Grid Search – Exhaustive parameter sweep over the hyper- parameter space – E.g. SVM with some kernel – Train on the Cartesian product of constant(c), kernel parameter(Y) – Suffers from Curse of Dimensionality i.e. the space to search the parameter value grows exponentially, data becomes sparse and difficult to search on • Random Search( Implemented ) – Similar to Grid Search – Randomized selection of the parameter values
  • 11.
    • Gradient BasedOptimization – Specialized algorithm for minimizing the generalization error in SVM • Bayesian Optimization(Future) – Sequential design for finding global optimization of a black box function – Based on developing a statistical model mapping a function from hyper-parameter values -> objective evaluated. – Initially placing a prior on the random function and then updating to form a posterior distribution which determines the next query point. – Reference: http://papers.nips.cc/paper/4522- practical-bayesian-optimization-of-machine-learning- algorithms.pdf
  • 12.
    Apache Spark HPO spark-submit--driver-memory 8g --executor-memory 8g --num-executors 2 --executor-cores 4 – class ../modellearner.ExecutionManager --master local[*] model-learner-0.0.1-jar-with- dependencies.jar --help Experimental Model Learner. For usage see below: -a, --aloha-spec-file <arg> Specify the aloha spec file name with path -c, --config-file <arg> Specify the config file name with path -n, --negative-downSampling <arg> Specify negative down-sampling needed as a percentage (default = 0.0) -q, --query_data <arg> Specify false if there is no need query for data(useful if data has been queried earlier) (default = true) -s, --seed-value <arg> Specify seed value (default = 0) -t, --top-n-values <arg> Specify the number for top n loss values (default = 10) --help Show help message *Currently supports only vw, can be extended to include others.
  • 13.
    Config Format properties:
 framework: 'vw'
 cmdPath:'vw'
 vwBasic: 'vw --noop -k --cache_file'
 bitPrecision: ’22'
 updatePolicies: loss_function:logistic;learning_rate: #0.01,0.02#;l1:#0.001,0.9#;l2:#0.0001,0.00#'
 manipulationPolicies: '-q JY -q JW -q IY -q IW -q YW -q JI'
 trainSplitPercentage: '0.8'
 iteration: '1'
 errorMetric: 'average loss'
 generateCache: 'true'
 sharedDir: ‘<xxxx>'
 dateRange: '2015-01-03 2015-01-04'
 region: ’yyy'
 matchType: ’<match_type>'
 hdfsBaseDir: ‘<hdfs_path>'
 hdfsOutputDir: ‘<output_path>'
 learnerOutputHdfs: ’<learner_output_path>'
  • 14.
    Fun with Math Randomor Grid or any other Bayesian method, how can one prepare the infrastructure for optimizing the learner ? i.e Number of iterations ? Lets see • Assume, chance = 0.05 and Guarantee: 0.95 • Probability of missing the optimum in ‘n’ iterations: (1-0.05)^n • Then Probability of success = 1 – (1 – 0.05)^n >=0.95 • Guess ?
  • 15.
    Future • Extend itto support other frameworks such as – R – Lib-SVM – Generalized vw implementation • Support feature exploration • Other forms of Adaptive Search instead of Random Search • Gradient based HyperParameter – Reversible Learning – Reference: http://arxiv.org/pdf/ 1502.03492v3.pdf
  • 16.
    Other Frameworks • vw-hypersearch https://github.com/JohnLangford/ vowpal_wabbit/wiki/Using-vw-hypersearch • Python based Hyperopt – No interaction or learning from previous runs – Can’t pause and resume – Parallelization is dependent on MongoDB( opens up another bag of worms ) • Auto Weka: on top of Weka( Difficult to customize and support other frameworks) • Spearmint: Only supports Bayesian Optimization – Difficult to customize to support other machine learning framework.
  • 17.
  • 18.
    Introducing VW • Isa efficient scalable online Machine Learning framework • Also supports importance weighting, multiple loss functions and other optimization algorithms. • Supports dynamic generation of feature interaction • Test set hold-out and early termination on multiple passes
  • 19.
    VW Scalability • Outof core learning, no need to load all the data in memory at once • Applies feature hashing to convert feature identities to a weight index using mumurHash3. For curious mind: https://gist.github.com/cartazio/2903178 • Effective use of the multi-cores • Written in c++, avoids nuances of jvm
  • 20.
    Demo of vw •vw file format [Label] [Importance] [Tag]|Namespace Features |Namespace Features ... | Namespace Features • Label: is the real number that we are trying to predict for this example • Importance: (importance weight) is a non-negative real number indicating the relative importance of this example over the others. • Tag: is a string that serves as an identifier for the example • Namespace: is an identifier of a source of information for the example • Features: is a sequence of whitespace separated strings, each of which is optionally followed by a float • Note: ** vertical bar, colon, space, and newline
  • 21.
  • 22.
    • Stackover flowmulticlass prediction problem Link to the problem: https://www.kaggle.com/c/predict- closed-questions-on-stack-overflow • Steps followed: – pre-process CSV data – convert it into VW format – run learning and prediction – Evaluate the model • Has a bunch of features, but for this demo the following features are used: – title, body, tags • Algorithm: OAA( One Against All )
  • 23.
    References • Random Searchfor Hyper-parameter optimization: http://www.jmlr.org/ papers/volume13/bergstra12a/ bergstra12a.pdf • Practical Bayesian Optimization of ML Algorithms: https://dash.harvard.edu/ handle/1/11708816