Lecture 19 chapter_4_regularized_linear_models

RegularizationDr. Mostafa A. Elhosseini

Ch. 4: Training Models
Dr. Mostafa A. Elhosseini
Mostafa A. Elhosseini https://youtube.com/c/drmelhosseini 2

Agenda
≡ Regularization

▪ Intro…
≡ the linear regression is not
a great model
▪ This is underfitting
▪ Known as high bias
▪ Bias: we have a strong
preconception that there
should be a linear fit
≡ Quadratic function
▪ Works well
▪ Just right
≡ High order polynomial
▪ Perform perfect on training
data
▪ high variance
▪ Not able to generalize on
unseen data
▪ If we have too many features
𝑦 = 𝜃0 + 𝜃1 𝑥 𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
𝜃3 𝑥3
+ 𝜃4 𝑥4

Overfitting
≡ If you perform high-degree Polynomial
Regression, you will likely fit the training data
much better than with plain Linear Regression
≡ high-degree Polynomial Regression model is
severely overfitting the training data, while the
linear model is underfitting it
≡ Overfitting: is a phenomenon where a machine
learning perform well on the training data but
fails to perform well on testing data
≡ Overfitting happens when a model learns the
details and noise in the training data to the
extent that it negatively impacts the
performance of the model on unseen data –
▪ or random fluctuations is picked up and learned as
concept

Overfitting
≡ The model that will generalize best in this case is the quadratic
model.
▪ It makes sense since the data was generated using a quadratic model, but…
▪ In general you won’t know what function generated the data, so…
ꙮ How can you tell that your model is overfitting or underfitting the
data?
ꙮ how can you decide how complex your model should be?

▪ How can you tell that your model is overfitting or
underfitting the data?
❶ Plotting hypothesis: is one way to decide – you may look for “curvy –
waggle”
▪ But does not always work
▪ Often have lots of features – harder to plot the data and visualize what feature to
keep and which one to throw out
❷ cross-validation metrics: If a model performs well on the training data
but generalizes poorly according to the cross-validation metrics, then
your model is overfitting.
▪ If it performs poorly on both, then it is underfitting.
❸ Learning curves: Another way is to look at the learning curves:
▪ these are plots of the model’s performance on the training set and the validation set
as a function of the training set size
▪ To generate the plots, simply train the model several times on different sized subsets
of the training set

▪ Addressing overfitting
≡ If you have lots of features and little data – overfitting can be a problem
≡ simple way to regularize a polynomial model is to reduce the number of
polynomial degrees
≡ Reduce number of features:
▪ Manually select which features to keep
▪ Feature engineering – reduce numbers of features
▪ But, we lose information
▪ why do not we just stop adding features when we got an acceptable model!!!
▪ You do not know which features to drop, and even worse if it turns out that every
feature is fairly informative, which means that dropping some feature will likely ruin
the algorithm performance – the answer is regularization
≡ Regularization : keep all features, but reduce magnitude of parameters 𝜃

Regularization
≡ Regularization is the process of regularizing the parameters that constrain,
regularizes, or shrinks the coefficient estimates towards zero
≡ Higher coefficient of polynomial terms lead to overfitting
▪ Having large coefficients can be seen as evidence of memorizing data rather learning
from them
▪ For example, if you have some noises in training dataset, those noises will cause our
model to put more weight into the coefficient of higher degree, and this lead to
overfitting
≡ with an increasing number of features, a machine learning model can fit
your data well – but if you add too many features we will be subjected to
Overfitting
‫والتفاصيل‬ ‫المواصفات‬ ‫سرد‬ ‫في‬ ‫كثيرا‬ ‫تدقق‬ ‫ال‬

Recall: cost function
𝑱 =
𝟏
𝟐𝒎
෍
𝒊=𝟏
𝒎
𝒉 𝜽(𝑿 𝒊 ) − 𝒚(𝒊) 𝟐
≡ You parameter can be updated in any way, just to lower the MSE
value, and take care…
▪ The larger your parameters become, the higher the chance your model overfit
the data

How to regularize?
• Penalize and make some of the 𝜃
parameters really small
• Like 𝜃3, 𝜃4
• 𝑱 =
𝟏
𝟐𝒎
σ𝒊=𝟏
𝒎
𝒉 𝜽(𝑿 𝒊 ) − 𝒚(𝒊) 𝟐
+ 𝟏𝟎𝟎𝟎 𝜽 𝟑
𝟐
+ 𝟏𝟎𝟎𝟎 𝜽 𝟒
𝟐
• The only way to minimize this function is to
make 𝜽 𝟑, 𝜽 𝟒 very small
• So here we end up with 𝜽 𝟑 and 𝜽 𝟒 being close
to zero
• Now we approximately have a quadratic
equation
• 𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2
𝑦 = 𝜃0 + 𝜃1 𝑥 + 𝜃2 𝑥2 +
𝜃3 𝑥3
+ 𝜃4 𝑥4

How to regularize?
• Smaller values for parameters corresponds to a simpler hypothesis
• You get rid of some of the terms
• A simpler hypothesis is less prone to overfitting
• But, we do not know what are the high order terms
• How do we choose the ones to shrink?
• It is better to shrink all parameters

▪ Main Goal
𝑱(𝜽) =
𝟏
𝟐𝒎
෍
𝒊=𝟏
𝒎
𝒉 𝜽(𝑿 𝒊
) − 𝒚(𝒊) 𝟐
+
𝝀
𝟐𝒎
෍
𝒋=𝟏
𝒏
𝜽𝑗
𝟐
≡ Minimize the cost function and restrict the parameters not to become too
large – thanks to the term of regularization
▪ 𝜆 is a constant the control the weight and importance of regularization term
▪ Control the trade off between the two goals
▪ Fit the training model very well
▪ Keep parameters small
▪ 𝑛 is the numbers of features
▪ By convention you do not penalize 𝜃0- minimization is from 𝜃1 onwards
▪ Minimizing the cost function consists of reducing MSE term and regularization term
≡ When parameter is updated to minimize MSE, and if it is becoming large, it will increase
the value of cost function by regularization term, and as a result it will be penalized and
updated to a small value

Regularization
• If 𝜆 is very large we end up penalizing all the parameters, so all the
parameters end up being close to zero “Except 𝜃0”
• Like we get rid of all the terms – underfitting
• Too biased 𝒉 𝜽 = 𝜽 𝟎
• 𝜆 should be chosen carefully – not too big

Regularization with Gradient Descent
• 𝑱(𝜽) =
𝟏
𝟐𝒎
σ𝒊=𝟏
𝒎
𝒉 𝜽(𝑿 𝒊 ) − 𝒚(𝒊) 𝟐
+ 𝝀 σ𝒋=𝟏
𝒏
𝜽𝑗
𝟐
• 𝜃0 is not regularized
• 𝜃0 = 𝜃0 −
𝛼
𝑚
σ𝑖=1
𝑚
ℎ 𝜃(𝑥 𝑖
) − 𝑦(𝑖)
𝑥0
(𝑖)
• 𝜃𝑗 = 𝜃𝑗 − 𝛼
1
𝑚
σ𝑖=1
𝑚
ℎ 𝜃(𝑥 𝑖 ) − 𝑦(𝑖) 𝑥𝑗
(𝑖) +
𝜆
𝑚
𝜃𝑗 𝑗 = 1,2,3, … , 𝑛
• 𝜃𝑗 = 𝜃𝑗 1 − 𝛼
𝜆
𝑚
−
𝛼
𝑚
σ𝑖=1
𝑚
ℎ 𝜃(𝑥 𝑖
) − 𝑦(𝑖)
𝑥𝑗
(𝑖)
𝑗 = 1,2,3, … , 𝑛
• The term 𝜃𝑗 1 − 𝛼
𝜆
𝑚
is going to be less than 1
• Usually learning rate is small and 𝑚 is large
• This term is often around 0.99 to 0.95
• i.e. shrink 𝜃𝑗 - the term on right is the same as before “Gradient descent

Regularization
≡ For a linear model, regularization is typically achieved by
constraining the weights of the model.
≡ We will now look at Ridge Regression, Lasso Regression, and Elastic
Net, which implement three different ways to constrain the weights.

Ridge Regression
≡ Also called Tikhonov regularization
≡ A regularization term equal to 𝝀 σ𝒋=𝟏
𝒏
𝜽𝑗
𝟐
is added to the cost function.
≡ Note that the regularization term should only be added to the cost function during
training
≡ Once the model is trained, you want to evaluate the model’s performance using the
unregularized performance measure
▪ It is quite common for the cost function used during training to be different from the performance
measure used for testing
▪ good training cost function should have optimization friendly derivatives, while the performance
measure used for testing should be as close as possible to the final objective
▪ A good example of this is a classifier trained using a cost function such as the log loss but
evaluated using precision/recall.
≡ It is important to scale the data (e.g., using a StandardScaler) before performing Ridge
Regression, as it is sensitive to the scale of the input features. This is true of most
regularized models.

▪ Ridge Regression
≡ As with Linear Regression, we can
perform Ridge Regression either by
computing a closed-form equation or by
performing Gradient Descent.
≡ The pros and cons are the same

▪ how to perform Ridge
Regression with Scikit-
Learn using a closed-
form solution
▪ using Stochastic Gradient Descent
▪ The penalty hyperparameter sets the type of regularization term to use.
▪ Specifying "l2" indicates that you want SGD to add a regularization term to
the cost function equal to half the square of the ℓ2 norm of the weight
vector:
▪ this is simply Ridge Regression.

▪ Ridge Regression
≡ As with Linear Regression, we can
perform Ridge Regression either by
computing a closed-form equation or by
performing Gradient Descent.
≡ The pros and cons are the same

Polynomial Regression with Ridge
regularization

▪ Lasso Regression
≡ Least Absolute Shrinkage and Selection Operator Regression (simply called Lasso
Regression)
≡ like Ridge Regression, it adds a regularization term to the cost function, but it uses the
ℓ1 norm of the weight vector instead of half the square of the ℓ2 norm
≡ 𝑱 𝜽 = 𝑴𝑺𝑬 𝜽 + 𝜶 σ𝒊=𝟏
𝒏
𝜽𝒊

Lasso Regression
≡ An important characteristic of Lasso Regression is that it tends to
completely eliminate the weights of the least important features
≡ Lasso Regression automatically performs feature selection and
outputs a sparse model (i.e., with few nonzero feature weights)
≡ Note that you could instead use an SGDRegressor(penalty="l1").

Elastic Net
• Elastic Net is a middle ground between Ridge Regression and Lasso
Regression.
• The regularization term is a simple mix of both Ridge and Lasso’s
regularization terms, and you can control the mix ratio 𝑟.
• When 𝑟 = 0, Elastic Net is equivalent to Ridge Regression, and when
𝑟 = 1, it is equivalent to Lasso Regression
• 𝑱 𝜽 = 𝑴𝑺𝑬 𝜽 + 𝒓𝜶 σ𝒊=𝟏
𝒏
𝜽𝒊 +
𝟏−𝒓
𝟐
𝜶 σ𝒋=𝟏
𝒏
𝜽𝑖
𝟐

▪ when should you use Linear Regression, Ridge, Lasso, or
Elastic Net?
≡ It is almost always preferable to have at least a little bit of regularization,
so generally you should avoid plain Linear Regression.
≡ Ridge is a good default, but if you suspect that only a few features are
actually useful, you should prefer Lasso or Elastic Net since they tend to
reduce the useless features’ weights down to zero as we have discussed.
≡ In general, Elastic Net is preferred over Lasso since Lasso may behave
erratically when the number of features is greater than the number of
training instances or when several features are strongly correlated.

▪ Early stopping
≡ A very different way to regularize iterative learning algorithms such
as Gradient Descent is to stop training as soon as the validation error
reaches a minimum – early stopping
≡ Figure shows a complex model (in this case a high-degree Polynomial
Regression model) being trained using Batch Gradient Descent.
≡ as the epochs go by, the algorithm learns and its prediction error
(RMSE) on the training set naturally goes down, and so does its
prediction error on the validation set.
≡ However, after a while the validation error stops decreasing and
actually starts to go back up.
≡ This indicates that the model has started to overfit the training data.
≡ With early stopping you just stop training as soon as the validation error
reaches the minimum. It is such a simple and efficient regularization
technique that Geoffrey Hinton called it a “beautiful free lunch.”

Early stopping
▪ With Stochastic and Mini-batch
Gradient Descent, the curves
are not so smooth, and it may
be hard to know whether you
have reached the minimum or
not.
▪ One solution is to stop only
after the validation error has
been above the minimum for
some time (when you are
confident that the model will
not do any better), then roll
back the model parameters to
the point where the validation
error was at a minimum.

Lecture 19 chapter_4_regularized_linear_models

More Related Content

Similar to Lecture 19 chapter_4_regularized_linear_models

More from Mostafa El-Hosseini

Recently uploaded

Lecture 19 chapter_4_regularized_linear_models