2. Andrew Ng
Example: Linear regression (housing prices)
Overfitting: If we have too many features, the learned hypothesis
may fit the training set very well ( ), but fail
to generalize to new examples (predict prices on new examples).
Price
Size
Price
Size
Price
Size
d = 1 d = 2 d = 4
4. Andrew Ng
Addressing overfitting:
Price
Size
size of house
no. of bedrooms
no. of floors
age of house
average income in neighborhood
kitchen size
Overfitting is most likely to occur whenever we have a large no. of features and a
small no. of training examples.
5. Andrew Ng
Addressing overfitting:
Options:
1. Reduce number of features.
― Manually select which features to keep.
― Model selection algorithm (later in course).
2. Regularization.
― Keep all the features, but reduce magnitude/values of
parameters .
― Works well when we have a lot of features, each of
which contributes a bit to predicting .
8. Andrew Ng
Small values for parameters
― “Simpler” hypothesis
― Less prone to overfitting
Regularization.
Housing:
― Features:
― Parameters:
9. Andrew Ng
Regularization.
Price
Size of house
regularization parameter
Regularization parameter controls the
tradeoff between minimizing the
squared error term and minimizing
the parameters .
10. Andrew Ng
In regularized linear regression, we choose to minimize
What if is set to an extremely large value (perhaps too large for
our problem, say )?
- Algorithm works fine; setting to be very large can’t hurt it
- Algorithm fails to eliminate overfitting.
- Algorithm results in underfitting. (Fails to fit even training data
well).
- Gradient descent will fail to converge.
11. Andrew Ng
In regularized linear regression, we choose to minimize
What if is set to an extremely large value (perhaps too large for
our problem, say )?
Price
Size of house
20. Andrew Ng
gradient(1) = [ ];
function [jVal, gradient] = costFunction(theta)
jVal = [ ];
gradient(2) = [ ];
gradient(n+1) = [ ];
code to compute
code to compute
code to compute
code to compute
Advanced optimization
gradient(3) = [ ];code to compute
Editor's Notes
High variance: Fitting a high-order polynomial to the training dataset can fit a large variety of possible hypothesis. At the same time, the training dataset is not enough to constrain it to result in a good hypothesis.
Reducing no. of features results in a loss of information.
Regularization results in a smoother, less complicated hypothesis.
Model selection includes an automatic choice of the regularization parameter.