CSE357 fa21 (6) Linear Machine Learning11-11.pdf

From Linear Modeling to
Machine Learning
Statistics for Data Science
CSE357 - Fall 2021

● Ridge Regression (L2 Penalized)
● Lasso Regression (L1 Penalized)
● Feature Selection
● Accuracy Metrics
Linear Models

Finding a linear function based on X to best yield Y.
X = “covariate” = “feature” = “predictor” = “regressor” = “independent variable”
Y = “response variable” = “outcome” = “dependent variable”
Linear Regression

Regression:
goal: estimate the function r
Linear Regression

Regression:
The expected value of Y, given
that the random variable X is
equal to some specific value, x.
Linear Regression

Regression:
Linear Regression (univariate version):
goal: find 𝛽0
, 𝛽1
such that
Linear Regression

Review: Multiple Linear Regression
Residual:
Direct Estimates (Normal Equation)
y

What if Yi
∊ {0, 1}? (i.e. we want “classification”)
P(Yi
= 0 | X = x)
Thus, 0 is class 0
and 1 is class 1.
Review: Logistic Regression

Often we want to make a classification based on multiple features:
Logistic Regression with Multiple Feats
Y-axis is Y (i.e. 1 or 0)

Y-axis is Y (i.e. 1 or 0)
To make room for
multiple Xs, let’s get rid
of y-axis. Instead, show
decision point.

To make room for
multiple Xs, let’s get rid
of y-axis. Instead, show
decision point.
decision line
decision hyperplane

What if Yi
Review: Logistic Regression
We’re learning a linear (i.e. flat)
separating hyperplane, but fitting
it to a logit outcome.
(https://www.linkedin.com/pulse/predicting-outcomes-pr
obabilities-logistic-regression-konstantinidis/)

What if Yi
To estimate ,
one can use
reweighted least
squares:
(Wasserman, 2005; Li, 2010)
Logistic Regression: Faster Approach

X1
X2
X3
Y
(genes) (health)
Supervised Machine Learning

X1
X2
X3
Y

X1
X2
X3
Y
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
Y
X13
X14
X15
... Xm

X1
X2
X3
Y
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
Y
X13
X14
X15
... Xm
Task: Determine a function, f (or parameters to a function) such that f(X) = Y

X1
X2
X3
Y
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
Y
X13
X14
X15
... Xm
Task: Determine a function, f (or parameters to a function) such that f(X) = Y
Supervised Learning

Original Data New Data?
Does the
model hold up?
Model
How to Evaluate?

Training Data Testing Data
Model
Does the
model hold up?
How to Evaluate?

Training
Data
Testing Data
Model
Develop-
ment
Data
Model
Set training
hyperparameters
Does the
model hold up?
ML: GOAL

Goal: Decent estimate of model accuracy
train test
dev
All data
train test
dev train
train test
dev train
...
Iter 1
Iter 2
Iter 3
….
N-Fold Cross Validation

1
1
0
0
1
X = Y
0.5 0 0.6 1 0 0.25
0 0.5 0.3 0 0 0
0 0 1 1 1 0.5
0 0 0 0 1 1
0.25 1 1.25 1 0.1 2
Logistic Regression - Regularization

1
1
0
0
1
X = Y
0.5 0 0.6 1 0 0.25
0 0.5 0.3 0 0 0
0 0 1 1 1 0.5
0 0 0 0 1 1
0.25 1 1.25 1 0.1 2
1.2 + -63*x1
+ 179*x2
+ 71*x3
+ 18*x4
+ -59*x5
+ 19*x6
= logit(Y)

1
1
0
0
1
X = Y
0.5 0 0.6 1 0 0.25
0 0.5 0.3 0 0 0
0 0 1 1 1 0.5
0 0 0 0 1 1
0.25 1 1.25 1 0.1 2
1.2 + -63*x1
+ 179*x2
+ 71*x3
+ 18*x4
+ -59*x5
+ 19*x6
= logit(Y)
“overfitting”
colab

Underfit
(image credit: Scikit-learn; in practice data are rarely this clear)
Overfitting (1-d example)

Underfit Overfit
(image credit: Scikit-learn; in practice data are rarely this clear)
Overfitting (1-d example)

1
1
0
0
1
X = Y
0.5 0 0.6 1 0 0.25
0 0.5 0.3 0 0 0
0 0 1 1 1 0.5
0 0 0 0 1 1
0.25 1 1.25 1 0.1 2
1.2 + -63*x1
+ 179*x2
+ 71*x3
+ 18*x4
+ -59*x5
+ 19*x6
= logit(Y)
“overfitting”
Overfitting (multidimenstional example)

1
1
0
0
1
X = Y
0.5 0 0.6 1 0 0.25
0 0.5 0.3 0 0 0
0 0 1 1 1 0.5
0 0 0 0 1 1
0.25 1 1.25 1 0.1 2
1.2 + -63*x1
+ 179*x2
+ 71*x3
+ 18*x4
+ -59*x5
+ 19*x6
= logit(Y)
“overfitting”
“overfitting”: generally
due to trying to fit too
many features given the
number of observations.

1
1
0
0
1
X = Y
0.5 0
0 0.5
0 0
0 0
0.25 1
What if only 2
predictors?
x1
x2

1
1
0
0
1
X = Y
0.5 0
0 0.5
0 0
0 0
0.25 1
x1
x2
0 + 2*x1
+ 2*x2
= logit(Y)
What if only 2
predictors?
A: better fit

L1 Regularization - “The Lasso”
Zeros out features by adding values that keep from perfectly fitting the data.

set betas that minimize the loss (J)

set betas that minimize the penalized loss (J)

set betas that minimize the L1 penalized loss (J)
Sometimes written as:
Solved via gradient descent

Shrinks features by adding values that keep from perfectly fitting the data.

Sometimes written as:

Linear Regression - Regularization
Zeros out features by adding values that
keep from perfectly fitting the data.

Linear Regression - Regularization
Zeros out features by adding values that
keep from perfectly fitting the data.
mean squared error: same as rss but scales
better relative to the penalty lambda

Machine Learning Goal: Generalize to new data
Training Data
Testing Data
Model Does the
model hold up?
X Y
80%
20%

Machine Learning Goal: Generalize to new data
Training Data
Testing Data
Model
Does the
model hold up?
X Y
70%
10%
20%
Development
Set
penalty

Logistic regression: For classification -- when y is binary.
Solving for L1 or L2 regularized loss:
(a) gradient descent, (b) reweighted least squares, (c) coordinate descent
Linear regression: For regression-- when y is continuous.
Solving for L1 regularized loss: gradient descent.
Solving for L2 regularized loss: gradient descent OR normal equation:
Linear / Logistic Regression - Regularization

When L2, L1 work well?
k << N, k < N k == N k > N k >> N
L2 Sometimes Often Often Sometimes Almost Never
L1 Almost Never Sometimes Often Sometimes Almost Never
Depends on data, but generally we ﬁnd…
k: number of features; N: number of observations

1. Testing the relationship between variables given other
variables. 𝛽 is an “effect size” -- a score for the magnitude
of the relationship; can be tested for significance.
2. Building a predictive model that generalizes to new data.
Ŷ is an estimate value of Y given X.
Linear and Logistic Regression Uses:

1. Testing the relationship between variables given other
variables. 𝛽 is an “effect size” -- a score for the magnitude
of the relationship; can be tested for significance.
2. Building a predictive model that generalizes to new data.
Ŷ is an estimate value of Y given X.
However, unless |X| << observatations then the model
might “overfit”.
-> Regularized linear regression
Linear and Logistic Regression Uses:

CSE357 fa21 (6) Linear Machine Learning11-11.pdf

Recommended

Recommended

More Related Content

Similar to CSE357 fa21 (6) Linear Machine Learning11-11.pdf

Similar to CSE357 fa21 (6) Linear Machine Learning11-11.pdf (20)

Recently uploaded

Recently uploaded (20)

CSE357 fa21 (6) Linear Machine Learning11-11.pdf