Learning to Optimize

Learning to Optimize 
- Pramit Choudhary

Introduction
• What is a Machine Learning Model ?
• What is Hyper-parameter Optimization ?
• HPO pipeline
• Questions

Machine Learning Model
• A mathematical function representing the
relationship between aspects of the data
• Simplest Model: Linear Regression Model
– WtX=Y, where
• X: Vector representing features of the data
• Y: Scalar variable representing the target
• W: weight vector , that specifies the slope of the equation.
This is a model parameter learned during Model Training.
• Loss Function: Capture the quality of the model
based on prediction and ground truth

• A Typical Linear Regression:
• Loss Function Representation:

Ways to Optimize Loss
• Gradient Descent
Performing gradient descent while minimizing loss
function

Hyper-Parameter Optimization
• Parameter tuning
Parameter tuning at different Levels
Credit: Dato/Turi

• What is hyper-parameter ?
– Also known as nuisance parameters.
– Are values specified outside of the training process
E.g.
• Linear Regression
– Ridge Regression/LASSO: Weight for regularized term
• Logistic Regression
– Regularization parameters
• Decision Tree
– Desired Depth, split criteria
• SVM
– Penalty factor
– Kernel parameters, e.g. width for Radial Basic Function, degree for
polynomial function
• Optimizing Loss for Linear learners
– Learning rate, Decay Rate

Why is it important ?
• Determines how rigid the derived model
is(how feature dependent the model is).
• Proper tuning helps in reducing over-
fitting( generalizing the model)
• Helps in improving the accuracy of the
trained model.

Why is it a difficult task?
• Selecting an algorithm for Data Discovery
– Selecting the right algorithm is very important
• Post deciding on an algorithm, the process
of parameter selection is time consuming
• Parameter values can not be defined as a
closed form formula, result of a Black box.
• There is only so much you can do at one
point in time restricted by hardware.

Algorithms for Parameter tuning
• Grid Search
– Exhaustive parameter sweep over the hyper-
parameter space
– E.g. SVM with some kernel
– Train on the Cartesian product of constant(c), kernel
parameter(Y)
– Suffers from Curse of Dimensionality i.e. the space
to search the parameter value grows exponentially,
data becomes sparse and difficult to search on
• Random Search( Implemented )
– Similar to Grid Search
– Randomized selection of the parameter values

• Gradient Based Optimization
– Specialized algorithm for minimizing the
generalization error in SVM
• Bayesian Optimization(Future)
– Sequential design for finding global optimization of a
black box function
– Based on developing a statistical model mapping a
function from hyper-parameter values -> objective
evaluated.
– Initially placing a prior on the random function and
then updating to form a posterior distribution which
determines the next query point.
– Reference: http://papers.nips.cc/paper/4522-
practical-bayesian-optimization-of-machine-learning-
algorithms.pdf

Apache Spark HPO
spark-submit --driver-memory 8g --executor-memory 8g --num-executors 2 --executor-cores 4 –
class ../modellearner.ExecutionManager --master local[*] model-learner-0.0.1-jar-with-
dependencies.jar --help
Experimental Model Learner.
For usage see below:
-a, --aloha-spec-file <arg> Specify the aloha spec file name with path
-c, --config-file <arg> Specify the config file name with path
-n, --negative-downSampling <arg> Specify negative down-sampling needed as a
percentage (default = 0.0)
-q, --query_data <arg> Specify false if there is no need query for
data(useful if data has been queried earlier)
(default = true)
-s, --seed-value <arg> Specify seed value (default = 0)
-t, --top-n-values <arg> Specify the number for top n loss values (default = 10)
--help Show help message
*Currently supports only vw, can be extended to include others.

Config Format
properties: 
framework: 'vw' 
cmdPath: 'vw' 
vwBasic: 'vw --noop -k --cache_file' 
bitPrecision: ’22' 
updatePolicies:
loss_function:logistic;learning_rate:
#0.01,0.02#;l1:#0.001,0.9#;l2:#0.0001,0.00#' 
manipulationPolicies: '-q JY -q JW -q IY -q IW -q YW -q JI' 
trainSplitPercentage: '0.8' 
iteration: '1' 
errorMetric: 'average loss' 
generateCache: 'true' 
sharedDir: ‘<xxxx>' 
dateRange: '2015-01-03 2015-01-04' 
region: ’yyy' 
matchType: ’<match_type>' 
hdfsBaseDir: ‘<hdfs_path>' 
hdfsOutputDir: ‘<output_path>' 
learnerOutputHdfs: ’<learner_output_path>'

Fun with Math
Random or Grid or any other Bayesian method,
how can one prepare the infrastructure for
optimizing the learner ? i.e Number of iterations ?
Lets see
• Assume, chance = 0.05 and Guarantee: 0.95
• Probability of missing the optimum in ‘n’ iterations: (1-0.05)^n
• Then Probability of success = 1 – (1 – 0.05)^n >=0.95
• Guess ?

Future
• Extend it to support other frameworks such as
– R
– Lib-SVM
– Generalized vw implementation
• Support feature exploration
• Other forms of Adaptive Search instead of
Random Search
• Gradient based HyperParameter – Reversible
Learning
– Reference: http://arxiv.org/pdf/
1502.03492v3.pdf

Other Frameworks
• vw- hypersearch https://github.com/JohnLangford/
vowpal_wabbit/wiki/Using-vw-hypersearch
• Python based Hyperopt
– No interaction or learning from previous runs
– Can’t pause and resume
– Parallelization is dependent on MongoDB( opens up
another bag of worms )
• Auto Weka: on top of Weka( Difficult to customize
and support other frameworks)
• Spearmint: Only supports Bayesian Optimization
– Difficult to customize to support other machine
learning framework.

Introducing VW
• Is a efficient scalable online Machine
Learning framework
• Also supports importance weighting,
multiple loss functions and other
optimization algorithms.
• Supports dynamic generation of feature
interaction
• Test set hold-out and early termination on
multiple passes

VW Scalability
• Out of core learning, no need to load all
the data in memory at once
• Applies feature hashing to convert feature
identities to a weight index using
mumurHash3.
For curious mind: https://gist.github.com/cartazio/2903178
• Effective use of the multi-cores
• Written in c++, avoids nuances of jvm

Demo of vw
• vw file format
[Label] [Importance] [Tag]|Namespace Features |Namespace Features ... |
Namespace Features
• Label: is the real number that we are trying to predict for this example
• Importance: (importance weight) is a non-negative real number indicating
the relative importance of this example over the others.
• Tag: is a string that serves as an identifier for the example
• Namespace: is an identifier of a source of information for the example
• Features: is a sequence of whitespace separated strings, each of which is
optionally followed by a float
• Note: ** vertical bar, colon, space, and newline

• Stackover flow multiclass prediction
problem
Link to the problem: https://www.kaggle.com/c/predict-
closed-questions-on-stack-overflow
• Steps followed:
– pre-process CSV data
– convert it into VW format
– run learning and prediction
– Evaluate the model
• Has a bunch of features, but for this demo
the following features are used:
– title, body, tags
• Algorithm: OAA( One Against All )

References
• Random Search for Hyper-parameter
optimization: http://www.jmlr.org/
papers/volume13/bergstra12a/
bergstra12a.pdf
• Practical Bayesian Optimization of ML
Algorithms: https://dash.harvard.edu/
handle/1/11708816

Learning to Optimize

More Related Content

Similar to Learning to Optimize

More from Pramit Choudhary

Recently uploaded

Learning to Optimize