Machine Learning as a Service: #MLaaS, Open Source and the Future of (Legal) ...
Similar to Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito
Similar to Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito (20)
Police Misconduct Lawyers - Law Office of Jerry L. Steering
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validation - Professor Daniel Martin Katz + Professor Michael J Bommarito
1. Class 6
Overfitting, Underfitting, & Cross-validation
Legal Analytics
Professor Daniel Martin Katz
Professor Michael J Bommarito II
legalanalyticscourse.com
7. Overfitting occurs when a
statistical model or algorithm
captures the noise of the data
(as opposed to the signal)
access more at legalanalyticscourse.com
12. Why is generalization hard?
Learning, machine or otherwise, looks something like this:
! We are presented with a view of objects in the world.
! We encode aspects of these objects, e.g., colors, into “features.”
! We generalize from patterns in these features to statements about objects.
Example:
! We spend a summer on Michigan lakes and see many animals. All swans that we
see are white. We generalize from this sample to the statement that all swans are
white.
What went wrong? Mathematically speaking, we did not observe enough
variance in our observed sample; in fact, our observed variance for the color
feature was zero!
access more at legalanalyticscourse.com
13. Underfitting
Zero variance in our observed sample led to a model with a constant
predicted value; this model underfits the true variance of swans.
Underfitting is, in essence, model simplification or ignorance of signal.
Underfit models may perform well on modal data, but they typically struggle
with lower-frequency or more complex cases.
Underfitting can occur for a number of reasons:
! The model is too simple for the actual system. Technically speaking, either the
model does not contain enough parameters or the functional forms are not capable of
spanning the true functions.
! The number of records or variance of the records does not provide the learning
process with enough information.
access more at legalanalyticscourse.com
14. Underfitting
Let’s look at a simple example – fitting a quadratic equation with a linear
function.
Quadratic functions look like this:
y = a^2 + b x + c
A function is therefore defined by supplying three parameters: a, b, and c.
To make this realistic, let’s add some simple N(0,1) random errors, giving us
the form:
y = a^2 + b x + c + e
where e is distributed N(0,1).
access more at legalanalyticscourse.com
16. Underfitting
What happens if we try to fit a model to this data. First, let’s start with a
simple linear function, i.e., linear regression.
Our linear form looks like this:
y = a x + b + e
A model is therefore defined by supplying two parameters: a and b.
access more at legalanalyticscourse.com
18. Underfitting
This linear model clearly does not capture the non-linear relationship
between x and y.
However, no combination of a and b will successfully match this across all x,
since the linear model is just too simple to represent a non-linear model.
Linear models have too few parameters to fit non-linear models! Thus, they
will typically underfit non-linear models.
(fit quadratic model below)
access more at legalanalyticscourse.com
19. Overfitting
Overfitting is the opposite of underfitting, and it occurs when a model
codifies noise into the structure.
Overfitting may occur for a number of reasons:
! Models that are much more complex than the underlying data, either in terms of
functional form or number of parameters.
! Learning that is too focused on minimizing the loss function for a single training
sample.
access more at legalanalyticscourse.com
20. Overfitting
Let’s return to our quadratic example from before. As we discussed, our
quadratic data was generated by a model with three parameters: a, b, and c.
When we tried to explain the data with just two parameters, the resulting
model underfit the data and did a poor job.
When we tried to explain the data with three parameters, the resulting
model did an excellent job of fitting the data.
What happens if we try to explain the data with seven parameters?
access more at legalanalyticscourse.com
21. Overfitting
First, let’s focus on the portion of data that we saw in our training set before
– the range where x lies between -4 and 4.
At first blush, it looks like we’ve done an excellent job. Compared to our
three parameter quadratic fit, we have done an even better job of reducing the
some of our squared residuals. Why not always use more parameters?
access more at legalanalyticscourse.com
22. Overfitting
But what happens if we look outside of this (-4, 4) range? It turns out that
we’ve committed two common overfitting mistakes:
! Our model is much more complex than the underlying data. Quadratic relationships
are built on three parameters, whereas our model uses eight. When we minimized
our loss function, the extra five parameters were used to fit to noise, not signal!
! Our model was trained on a very narrow sample of the world. While we do an
excellent job of predicting values between -4 and 4, we do a very poor job outside of
this range
access more at legalanalyticscourse.com
23. Generalizing safely
So what can we do to safely generalize? Two of the most common approaches
are regularization and cross-validation.
Regularization is …
access more at legalanalyticscourse.com
24. Cross-validation
Cross-validation, like regularization, is meant to prevent the learning
process from codifying sample-specific noise as structure.
However, unlike regularization, cross-validation does not impose any
geometric constraints on the shape or “feel” of our learning solution, i.e.,
model.
Instead, it focuses on repeating the learning task on multiple samples of
training data, then evaluating the performance of these models on the “held-
out” or unseen data.
access more at legalanalyticscourse.com
25. Cross-validation: K-fold
The most common approach to cross-validation is to divide the training set of
data into K distinct partitions of equal size. K-1 of these partitions are then
used to learn a models. The resulting model is then used to predict the Kth
partition. This process is repeated K times, and the best performing sample is
kept as the trained model.
http://genome.tugraz.at/proclassify/help/pages/XV.html
http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english
access more at legalanalyticscourse.com
26. “Cross-validation is widely used to check model error by
testing on data not part of the training set. Multiple rounds
with randomly selected test sets are averaged together to
reduce variability of the cross-validation; high variability of
the model will produce high average errors on the test set.
One way of resolving the trade-off is to use mixture models
and ensemble learning. For example, boosting combines many
‘weak’ (high bias) models in an ensemble that has greater
variance than the individual models, while bagging combines
‘strong’ learners in a way that reduces their variance.”
http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29
access more at legalanalyticscourse.com
27. Legal Analytics
Class 6 - Overfitting, Underfitting, & Cross-Validation
daniel martin katz
blog | ComputationalLegalStudies
corp | LexPredict
michael j bommarito
twitter | @computational
blog | ComputationalLegalStudies
corp | LexPredict
twitter | @mjbommar
more content available at legalanalyticscourse.com
site | danielmartinkatz.com site | bommaritollc.com