Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito

Class 6
Overﬁtting, Underﬁtting, & Cross-validation
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
legalanalyticscourse.com

Model Fit
access more at legalanalyticscourse.com

We are interested in how well
a given model performs

both on existing data

Underﬁtting occurs when a
statistical model or algorithm
cannot capture the
underlying trend of the data

an underﬁt
model
has
low variance,
high bias

Overﬁtting occurs when a
statistical model or algorithm
captures the noise of the data
(as opposed to the signal)

an overﬁt
model
has
low bias,
high variance

Model Fit
is all about
generalization

Underfitting/Overfitting
The challenge of generalization

Why is generalization hard?
Learning, machine or otherwise, looks something like this:
!  We are presented with a view of objects in the world.
!  We encode aspects of these objects, e.g., colors, into “features.”
!  We generalize from patterns in these features to statements about objects.
Example:
!  We spend a summer on Michigan lakes and see many animals. All swans that we
see are white. We generalize from this sample to the statement that all swans are
white.
What went wrong? Mathematically speaking, we did not observe enough
variance in our observed sample; in fact, our observed variance for the color
feature was zero!

Underfitting
Zero variance in our observed sample led to a model with a constant
predicted value; this model underfits the true variance of swans.
Underfitting is, in essence, model simplification or ignorance of signal.
Underfit models may perform well on modal data, but they typically struggle
with lower-frequency or more complex cases.
Underfitting can occur for a number of reasons:
!  The model is too simple for the actual system. Technically speaking, either the
model does not contain enough parameters or the functional forms are not capable of
spanning the true functions.
!  The number of records or variance of the records does not provide the learning
process with enough information.

Underfitting
Let’s look at a simple example – fitting a quadratic equation with a linear
function.
Quadratic functions look like this:
y = a^2 + b x + c
A function is therefore defined by supplying three parameters: a, b, and c.
To make this realistic, let’s add some simple N(0,1) random errors, giving us
the form:
y = a^2 + b x + c + e
where e is distributed N(0,1).

Underfitting
Example:
y = x^2 + 2 x + 1 + e

Underfitting
What happens if we try to fit a model to this data. First, let’s start with a
simple linear function, i.e., linear regression.
Our linear form looks like this:
y = a x + b + e
A model is therefore defined by supplying two parameters: a and b.

Underfitting
Example:
y = 1.94 x + 6.62

Underfitting
This linear model clearly does not capture the non-linear relationship
between x and y.
However, no combination of a and b will successfully match this across all x,
since the linear model is just too simple to represent a non-linear model.
Linear models have too few parameters to fit non-linear models! Thus, they
will typically underfit non-linear models.
(fit quadratic model below)

Overfitting
Overfitting is the opposite of underfitting, and it occurs when a model
codifies noise into the structure.
Overfitting may occur for a number of reasons:
!  Models that are much more complex than the underlying data, either in terms of
functional form or number of parameters.
!  Learning that is too focused on minimizing the loss function for a single training
sample.

Overfitting
Let’s return to our quadratic example from before. As we discussed, our
quadratic data was generated by a model with three parameters: a, b, and c.
When we tried to explain the data with just two parameters, the resulting
model underfit the data and did a poor job.
When we tried to explain the data with three parameters, the resulting
model did an excellent job of fitting the data.
What happens if we try to explain the data with seven parameters?

Overfitting
First, let’s focus on the portion of data that we saw in our training set before
– the range where x lies between -4 and 4.
At first blush, it looks like we’ve done an excellent job. Compared to our
three parameter quadratic fit, we have done an even better job of reducing the
some of our squared residuals. Why not always use more parameters?

Overfitting
But what happens if we look outside of this (-4, 4) range? It turns out that
we’ve committed two common overfitting mistakes:
!  Our model is much more complex than the underlying data. Quadratic relationships
are built on three parameters, whereas our model uses eight. When we minimized
our loss function, the extra five parameters were used to fit to noise, not signal!
!  Our model was trained on a very narrow sample of the world. While we do an
excellent job of predicting values between -4 and 4, we do a very poor job outside of
this range

Generalizing safely
So what can we do to safely generalize? Two of the most common approaches
are regularization and cross-validation.
Regularization is …

Cross-validation
Cross-validation, like regularization, is meant to prevent the learning
process from codifying sample-specific noise as structure.
However, unlike regularization, cross-validation does not impose any
geometric constraints on the shape or “feel” of our learning solution, i.e.,
model.
Instead, it focuses on repeating the learning task on multiple samples of
training data, then evaluating the performance of these models on the “held-
out” or unseen data.

Cross-validation: K-fold
The most common approach to cross-validation is to divide the training set of
data into K distinct partitions of equal size. K-1 of these partitions are then
used to learn a models. The resulting model is then used to predict the Kth
partition. This process is repeated K times, and the best performing sample is
kept as the trained model.
http://genome.tugraz.at/proclassify/help/pages/XV.html
http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english

“Cross-validation is widely used to check model error by
testing on data not part of the training set. Multiple rounds
with randomly selected test sets are averaged together to
reduce variability of the cross-validation; high variability of
the model will produce high average errors on the test set.
One way of resolving the trade-off is to use mixture models
and ensemble learning. For example, boosting combines many
‘weak’ (high bias) models in an ensemble that has greater
variance than the individual models, while bagging combines
‘strong’ learners in a way that reduces their variance.”
http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito

Similar to Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito (20)

More from Daniel Katz

More from Daniel Katz (15)

Recently uploaded

Recently uploaded (20)

Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito