Machine Learning course Lecture number 4, Linear regression and variants.pptx

Linear regression
Regularized regression and logistic regression

Supervised ML, basics
Outline:
● k-NN: intuitive non-parametric classifier
● Naive Bayes: relatively simple parametric
classifier & intro do Bayes terminology
● Linear regression: the simplest regression
model (after k-NN regression)
● Regularized linear regression: introduction
of an extremely powerful hyperparameter

Purpose of regression
Succinctly, predicting numbers rather than labels, for example:
● Predicting the severity or strength, rather than presence or absence
○ e.g. how many inches of rain, rather than the presence or absence of rain
● Predicting the future for a quantity of interest
○ Resource allocation for future growth, stock valuation, etc
● Matching a data sample to a known quantity to help interpretation
○ e.g. predicting clinical scores on mobility based on obscure wearable device signals

Linear regression
Finding the best ak
y = a1 x1 + a2 x2+ a3 x3 ...
One of the simplest regression models, but
also has a lot of subtle variations we will
discuss.
● Intercepts
● Extensible with complex features (e.g.
polynomial with powers)
● Variations in error metrics
● How to handle overfitting (regularization)

Linear regression, more powerful than it may appear
With more complex features, linear regression is arbitrarily powerful
● High frequency trader colleague used regularized linear regression a great
deal, but with complex features
Linear regression can be used to provide a graded classification
○ Binary classification is equivalent to a 0-1 regression with a threshold
○ More on this when discussing logistic regression
○ And multiclass classification can be performed with a one-hot encoding scheme.

Ordinary least squares
Minimizing the sum of squares error
advantages: Fast ( O(np2
) where p
is # of features )
disadvantages: includes all features,
even irrelevant ones

Ridge regression
Regularization can be used to simplify feature selection within a linear regression
model
Cost is both square error and square of coefficient values.
Note λ = 0: is ordinary linear regression
As λ increases, more feature selection occurs

Lasso Regression
Lasso Regression uses the absolute value of the coefficients as the cost. This
creates a sparser set of features (many become zero as λ increases). More useful
for feature selection than Ridge regression.
Disadvantage: requires coordinate descent to fit (vs. Ridge with the same time
complexity as OLS)

Why does Lasso leads to sparser models?
The cost surface for Lasso
makes it more likely that the
lowest cost point will have
one feature at zero.
Similar to why when
someone throws a die at
you, you’re more likely to
first get hit by a corner or
edge rather than face.

Generalized linear model (GLM) regression
Linear models are strictly of this form
Y = b0 + b1X1 + b2X2 + ... + bkXk
But one problem with this is an infinite range for Y, which sometimes makes little
sense. To fix this, GLMs wrap the linear output in a nonlinear function
Y = g (b0 + b1X1 + b2X2 + ... + bkXk)
Why is that useful? For one version of g, we can force y to be between 0 and 1,
which may make regression useful for classification

Logistic regression
Now we have a function
that can predict a 0 or a 1
in extreme cases, but a
graded response when
uncertain.
Why is this useful?
Y = g (b0 + b1X1 + b2X2 + ... + bkXk)+ e
where g = 1 / (1 + e-x
)

Logistic Regression example (wikipedia)
A group of 20 students spends between 0
and 6 hours studying for an exam. How
does the number of hours spent studying
affect the probability of the student passing
the exam?

Linear regression summary
Ridge and Lasso are types of regularized linear regression
Regularized regression provides a hyperparameter to tune the complexity of the
model through automated feature selection
Recall, it is critical to reduce the number of features
Because n >> (# features)2
for good learning
Lasso produces sparser models, but takes longer to converge (which practically
isn’t much of an issue).
Generalized linear models extend in powerful ways: e.g. logistic regression
Regularized logistic regression is a very powerful classifier and easy to
understand classifier

An aside: Why should n > features2
?
Why the rough rule of thumb that the number of samples should be
significantly greater than the number of features squared?
Remember how classification is about grouping similar points in a high
dimensional space (drawing boundaries between those groups)
Most ways of defining groups in n-dimensional spaces require O(n2
) parameters
Let’s try out a few...

Rectangles, and hyperrectangles - O(n) parameters
Description Dimens
ions
Parameters
per
boundary
Number of
boundaries
Total
parameters
Rectangle 2 1 4 4
Right rectangular
prism
3 1 6 6
Hyperrectangle n 1 2n 2n

Quadrilaterals, Hexahedra, … O(n2
) parameters
Description Dimensions Parameters
per
boundary
Number of
boundaries
Total
parameters
Quadrilateral 2 2 (defines a
line)
4 8
Hexahedron 3 3 (defines a
plane)
6 18
... n n 2n 2n2

Naive Bayes - O(n) parameters
Description Dim Parameters
per
dimension
(μ and σ)
Total
parameters
Ellipse 2 2 4
Ellipsoid 3 2 6
... n 2 2n

Limits of Naive Bayes graphically
Naive bayes treats features
independently. Often features have
strong dependencies which should be
modelled.
Naive bayes gaussians are forced to
align with the axes
But, for example, classifying male vs
female by height and weight needs
gaussians that ‘tilt’ off-axis.
This requires a different functional form.

What it takes to fit an
arbitrary gaussian shape
A mean for each feature, just like
Naive Bayes.
And a covariance matrix (instead of
just a list of variances)

With generalized gaussians -
O(n2
) parameters
Description Dim Parameters
for μ
Parameters
for Σ
(unique
because
symmetric)
Total
parameters
1D
gaussian
1 1 1 2
2D
gaussian
2 2 4 (3 unique) 6
n-D
gaussian
n n n2
(roughly
half unique)
O(n2
)

To recap on the n > features2
intution...
For arbitrary polyhedra, it takes O(n2
) parameters
For arbitrary gaussians, it takes O(n2
) parameters
…
And there are serious overfitting problems if you have more parameters than you
have samples (note, for linear equations, n samples can generally be fit perfectly
with n features)
But Naive Bayes only needed O(n)? That’s why naive bayes is known to work well
with smaller feature sets, but perform poorly when more data is available.

Machine Learning course Lecture number 4, Linear regression and variants.pptx

More Related Content

Similar to Machine Learning course Lecture number 4, Linear regression and variants.pptx

Recently uploaded

Machine Learning course Lecture number 4, Linear regression and variants.pptx