1 - Linear Regression

Linear Regression
Machine Learning Seminar Series’11

Nikita Zhiltsov

11 March 2011

1 / 31

Motivating example
Prices of houses in Portland

Living area (feet2 ) #bedrooms Price (1000$s)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
.
. .
.
. .

2 / 31

Motivating example
Plot

How can we predict the prices of other houses as a
function of the size of their living areas?
3 / 31

Terminology and notation
x ∈ X – input variables (“features”)
t ∈ T – a target variable
{xn }, n = 1, . . . , N – given N observations of
input variables
(xn , tn ) – a training example
(x1 , t1 ), . . . , (xN , tN ) – a training set

Goal
Find a function y(x) : X → T (“hypothesis“) to
predict the value of t for a new value of x

4 / 31


When the target variable t is continuous
⇒ a regression problem
In the case of discrete values
⇒ a classiﬁcation problem
5 / 31

Loss function
L(t, y(x)) – loss function or cost function
In the case of regression problems expected loss is
given by:

E[L] = L(t, y(x))p(x, t) dx dt
R X

Example
Squared loss:
1
L(t, y(x)) = (y(x) − t)2
2
6 / 31

Linear basis function models
Linear regression

y(x, w) = w0 + w1 x1 + · · · + wD xD ,
where x = (x1 , . . . , xD )

In our example,

y(x, w) = w0 + w1 x1 + w2 x2
Living area # of bedrooms

7 / 31

Basis functions

Generally
M −1
y(x, w) = wj φj (x) = wT φ(x)
j=0

where φj (x) are known as basis functions.
Typically, φ0 (x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions:
φd (x) = xd .

8 / 31

Polynomial basis functions

1

Polynomial basis functions: 0.5

φj (x) = xj . 0

These are global; a small −0.5
change in x aﬀects all basis
functions. −1
−1 0 1

9 / 31

Gaussian basis functions

Gaussian basis functions: 1

(x − µj )2 0.75
φj (x) = exp −
2s2
0.5
These are local; a small
change in x only aﬀects 0.25
nearby basis functions. µj
and s control location and 0
−1 0 1
scale (width).

10 / 31

Sigmoidal basis functions
Sigmoidal basis functions:

x − µj
φj (x) = σ 1
s
0.75
where
1 0.5
σ(a) = .
1 + exp (−a)
0.25
Also these are local; a small
0
change in x only aﬀects −1 0 1
nearby basis functions. µj
and s control location and
scale (slope).

11 / 31

Probabilistic interpretation

Assume observations from a deterministic function with added
Gaussian noise:

t = y(x, w) + , where p( |β) = N ( |0, β −1 )

which is the same as saying,

p(t|x, w, β) = N (t|y(x, w), β −1 ).

12 / 31

Optimal prediction for a squared loss

Expected loss:

E[L] = (y(x) − t)2 p(x, t)dxdt,

which is minimized by the conditional mean

y(x) = Et [t|x]

In our case of a Gaussian conditional distribution, it is

E[t|x] = tp(t|x)dt = y(x, w)

13 / 31

Optimal prediction for a squared loss

t

y(x)

y(x0 )
p(t|x0 )

x0 x

14 / 31

Maximum likelihood and least squares
Given observed inputs X = {x1 , . . . , xN }, and targets
t = [t1 , . . . , tN ]T , we obtain the likelihood function
N
p(t|X, w, β) = N (tn |wT φ(xn ), β −1 ).
n=1

Taking the logarithm, we get
N
ln p(t|w, β) = ln N (tn |wT φ(xn ), β −1 )
n=1
N N
= ln β − ln(2π) − βED (w)
2 2
where
N
1
ED (w) = (tn − wT φ(xn ))2
2 n=1
is the sum-of-squares error.
15 / 31

Computing the gradient and setting it to zero yields
N

w ln p(t|w, β) = β (tn − wT φ(xn ))φ(xn )T = 0.
n=1

Solving for w, we get
Moore-Penrose
pseudo-inverse

wM L = (ΦT Φ)−1 ΦT t
where
φ0 (x1 ) φ1 (x1 ) . . . φM −1 (x1 )
 
 φ0 (x2 ) φ1 (x2 ) . . . φM −1 (x2 ) 
Φ= .
. .
. .. .
.
.
 . . . . 
φ0 (xN ) φ1 (xN ) . . . φM −1 (xN )
16 / 31

Bias parameter
Rewritten error function:
N M −1
1
ED (w) = (tn − w0 − wj φj (xn ))2
2 n=1 j=1

Setting the derivative w.r.t. w0 equal to zero, we obtain
M −1
¯
w0 = t − ¯
wj φj
j=1

where
N
¯ 1
t= ¯
tn , φj = 1 N
n=1 φj (xn ).
N
N n=1

17 / 31


N N
ln p(t|wM L , β) = ln β − ln(2π) − βED (wM L )
2 2
Maximizing the log likelihood function w.r.t. the
noise precision parameter β, we obtain
N
1 1
= (tn − wM L φ(xn ))2
T
βM L N n=1

18 / 31

Geometry of least squares
Consider

y = ΦwM L = [ϕ1 , . . . , ϕM ]wM L . S
t

y∈S ⊆T, t∈T ϕ2
ϕ1 y
S is spanned by ϕ1 , . . . , ϕM .
wM L minimizes the distance
between the distance t between
and t and its orthogonal
projection on S, i.e. y.

19 / 31

Batch learning
Batch gradient descent

Consider the gradient descent algorithm, which
starts with some initial w(0) :

w(τ +1) = w(τ ) − η ED
N
(τ )
= w +η (tn − w(τ )T φ(xn ))φ(xn ).
n=1

This is known as the least-mean-squares (LMS)
algorithm.

20 / 31

Example calculation

In the case of ordinary least squares and the only leaving area
(0) (0)
feature, we start from w0 = 48, w1 = 30 ...

21 / 31

Results of the example calculation
... and obtain the result w0 = 71.27, w1 = 0.1345.

22 / 31

Sequential learning

Data items considered one at a time (a.k.a. online
learning); use stochastic (sequential) gradient
descent:

w(τ +1) = w(τ ) − η En
= w(τ ) + η(tn − w(τ )T φ(xn ))φ(xn ).

Issue: how to choose η?

23 / 31

Underﬁtting and overﬁtting

24 / 31

Regularization
Outlier

25 / 31

Regularized least squares
Consider the error function:

ED (w) + λEW (w)
Data term + Regularization term

λ is called the regularization coeﬃcient
With the sum-of-squares error function and a quadratic
regularizer, we get
N
1 λ
(tn − wT φ(xn ))2 + wT w
2 n=1
2

which is minimized by

w = (λI + ΦT Φ)−1 ΦT t.
26 / 31

With a more general regularizer, we have
N M
1 T 2 λ
(tn − w φ(xn )) + |wj |q
2 n=1
2 j=1

q = 0.5 q=1 q=2 q=4

Lasso Quadratic

27 / 31

Lasso tends to generate sparser solutions than a
quadratic regularizer
w2 w2

w w

w1 w1

28 / 31

Multiple outputs
Analogously to the single output case we have:

p(t|x, W, β) = N (t|y(W, x), β −1 I))
= N (t|WT φ(x), β −1 I)).

Given observed inputs X = {x, . . . , xN }, and targets
T = [t1 , . . . , tN ]T , we obtain the log likelihood function
N
ln p(T|X, W, β) = ln N (tn |WT φ(xn ), β −1 I)
n=1
N
NK β β
= ln − ||tn − WT φ(xn )||2 .
2 2π 2 n=1

29 / 31

Multiple outputs

Maximizing w.r.t. W, we obtain

WM L = (ΦT Φ)−1 ΦT T

If we consider a single target variable tk , we see that

wk = (ΦT Φ)−1 ΦT tk

where tk = [t1k , . . . , tN k ]T , which is identical with
the single output case.

30 / 31

Resources

ˆ Stanford Engineering Everywhere CS229 –
Machine Learning
http://videolectures.net/
stanfordcs229f07_machine_learning/
ˆ Bishop C.M. Pattern Recognition and Machine
Learning. Springer, 2006.
http://research.microsoft.com/en-us/
um/people/cmbishop/prml/

31 / 31

1 - Linear Regression

Recommended

Recommended

More Related Content

Similar to 1 - Linear Regression

Similar to 1 - Linear Regression (20)

1 - Linear Regression