Model Selection Techniques

Module 5
SHIWANI GUPTA
Model selection/diagnosis techniques
Cross Validation
Learning Curve
Hyperparameter Optimization/Tuning
Grid and Randomized Search
Validation Curve

What?
A variety of models of different complexity, how should we pick the right one?
Select a proper level of flexibility for the model
Not best but good enough model
Model selection different from Model assessment
Model development Pipeline
2

Split
Fit candidate models on the training set
Evaluate and select them on the validation set
Report performance of the final model on the test set
Train Validation Test
Model Selection Model Assessment
3

Types
In Sample Error
Probabilistic with LR, LoR
Akaike Information Criterion
Bayesian Information Criterion
Minimum Description Length
Structural Risk Minimization
Extra Sample Error
Resampling
Random train/test split
Cross Validation
Bootstrap
4

CV Types
• Train/Test Split: uses random sampling https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
• kFold CV: resampling, Stochastic sampling
• Shuffle Split CV: random sampling entire training data during each iteration
• LOOCV: Taken to another extreme, k may be set to the total number of observations in the dataset such that each observation is given a chance to be held out
of the dataset. This is called leave-one-out cross-validation. Deterministic sampling
• Stratified: Splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given class
outcome value.
• Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition,
which results in a different split of the sample.
• Nested: This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model
evaluation. This is called nested cross-validation or double cross-validation.
Variance decreases Variance increases
Computation increases
k increases
k fold
Leave-one-out
5

K Fold Procedure
This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The
first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.
1.Shuffle the dataset randomly.
2.Split the dataset into k groups
3.For each unique group:
1. Take the group as a hold out or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation score and discard the model
4.Summarize the skill of the model using the average of model evaluation scores
The results of a k-fold cross-validation run are often summarized with the mean of the model skill scores. It is also good
practice to include a measure of the variance of the skill scores, such as the standard deviation or standard error.
……
test train
K-fold
6

LOOCV
 should not be used, such as when you have a very large dataset or a computationally expensive model to evaluate.
 has the maximum computational cost. It requires one model to be created and evaluated for each example in the training dataset.
 Appropriate for <1000 samples
 k=N
 Leave p out CV is generalization
https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/
7

Stratified kFold CV
 each fold contains roughly the same proportions of the two types of class labels.
 stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation.
 Variant is RepeatedStratifiedKFold
https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/
8

K?
 The choice of k is usually 5 or 10, but there is no formal rule.
 There is a bias-variance trade-off associated with the choice of k in k-fold cross-validation.
 Large k means less bias towards overestimating the true expected error (as training folds will be
closer to the total dataset) but higher variance and higher running time (as you are getting closer
to the limit case: Leave-One-Out CV).
9

Learning Curve
 It is a plot of model learning performance over experience or time.
 It can be used to diagnose problems with learning, such as an underfit or overfit model.
 It can be used to diagnose whether the training and validation datasets are suitably representative.
 The metric used to evaluate learning could be maximizing, eg. classification accuracy or minimizing, eg. mean square
error
 It is more common to use a score that is minimizing, such as loss or error whereby better scores (smaller numbers)
indicate more learning and a value of 0.0 indicates that the training set was learned perfectly and no mistakes were made.
 It can be evaluated on the training set to give an idea of how well the model is “learning.” It can also be evaluated on a
hold-out validation set that is not part of the training dataset. Evaluation on the validation set gives an idea of how well
the model is “generalizing.”
10

Diagnosing Model Behaviour
Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.
An underfit model can be identified from the learning curve of the training loss only.
An underfit model may also be identified by a training loss that is decreasing and continues to decrease at the end of the plot.
This indicates that the model is capable of further learning and possible further improvements and that the training process was halted
prematurely.
11

Overfitting refers to a model that has learned the training dataset too well, including the statistical noise or random fluctuations in the training dataset.
The problem with overfitting is that more specialized the model becomes to training data, less well it is able to generalize to new data, resulting in an increase in
generalization error. This increase in generalization error can be measured by the performance of model on the validation dataset.
This often occurs if the model is trained for too long.
A plot of learning curve shows overfitting if:
 The plot of training loss continues to decrease with experience.
 The plot of validation loss decreases to a point and begins increasing again.
The inflection point in validation loss may be the point at which training could be halted, as experience
after that point shows the dynamics of overfitting.
12

A good fit is identified by training and validation loss that decreases to a point of stability with a minimal gap between the two final loss values.
The loss of the model will almost always be lower on the training dataset than the validation dataset. This means that we should expect some gap
between the train and validation loss learning curves. This gap is referred to as the “generalization gap”.
A plot of learning curve shows a good fit if:
• The plot of training loss decreases to a point of stability.
• The plot of validation loss decreases to a point of stability and has a small gap with the
training loss.
Continued training of a good fit will likely lead to an overfit.
13

HyperParameter
Parameters are learned automatically while hyperparameters are set manually to help guide the
learning process.
Eg. parameters: SV in SVM, coeff in LR, LoR
hyperparameter optimization, hyperparameter tuning or hyperparameter search: to search for a set of
hyperparameters that result in the best performance of a model on a dataset.
To speed up optimization: set the “n_jobs” argument to the number of cores on your machine.
It is desirable to select a minimum subset of model hyperparameters to search or tune.
14

HyperParameters of Models
Logistic Regression solver in [‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]
penalty in [‘none’, ‘l1’, ‘l2’, ‘elasticnet’]
C in [100, 10, 1.0, 0.1, 0.01]
https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.htm
l
K Nearest Neighbor n_neighbors in [1 to 21]
metric in [‘euclidean’, ‘manhattan’, ‘minkowski’]
https://medium.com/@mohtedibf/in-depth-parameter-tuning-for-knn-
4c0de485baf6
Support Vector Machine kernels in [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’]
C in [100, 10, 1.0, 0.1, 0.001]
gamma in [1, 0.1, 0.01, 0.001, 0.0001]
https://yunhaocsblog.wordpress.com/2014/07/27/the-effects-of-hyperparameters-
in-svm/
Decision Tree criterion in ['gini', 'entropy’]
max_depth in [1, 2, 3, 4, 5, 6, 7, 8]
min_samples_split in [2, 3]
https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Random Forest max_features [‘sqrt’, ‘log2’]
n_estimators in [10, 100, 1000]
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-
python-using-scikit-learn-28d2aa77dd74
GBM/XGB learning_rate in [0.001, 0.01, 0.1]
n_estimators [10, 100, 1000]
subsample in [0.5, 0.7, 1.0]
max_depth in [3, 7, 9]
https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-
tuning-xgboost-with-codes-python/
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-
tuning-gradient-boosting-gbm-python/
15

Grid Search
Define a search space as a grid of hyperparameter values and evaluate every position in the grid.
Extension is GridSearchCV
16
https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e

Randomized Search
Define a search space as a bounded domain of hyperparameter values and randomly sample points in
that domain.
Extension is RandomizedSearchCV
17
https://www.section.io/engineering-education/random-search-hyperparameters/

Model Selection
Validation Curve – Diagnosing Model Behaviour
A validation curve is typically drawn between some parameter of the model and the model’s score.
Two curves are present in a validation curve – one for the training set score and one for the cross-validation score.
By default, the function for validation curve, present in the scikit-learn library performs 3-fold cross-validation.
•Ideally, we would want both validation curve and training curve to look as similar as possible.
•If both scores are low, the model is likely to be underfitting. This means either the model is too simple or it is informed
by too few features. It could also be the case that the model is regularized too much.
•If the training curve reaches a high score relatively quickly and the validation curve is lagging behind, the model
is overfitting. This means the model is very complex or it could simply mean there is too little data.
•We would want the value of the parameter where the training and validation curves are closest to each other.
18

Interpreting Validation Curve
19
https://towardsdatascience.com/
validation-curve-explained-plot-
the-influence-of-a-single-
hyperparameter-1ac4864deaf8

SA5
Explain Cross Validation and its variants with appropriate diagram.
Compare Grid and Randomized Search along with CV variant.
State hyperparameter description of any 5 ML models.
Define Learning Curve and explain interpretation with example.
Define Validation Curve and explain interference with example.
20

Model Selection Techniques

Recommended

Recommended

More Related Content

Similar to Model Selection Techniques

Similar to Model Selection Techniques (20)

More from Shiwani Gupta

More from Shiwani Gupta (20)

Recently uploaded

Recently uploaded (20)

Model Selection Techniques