The first report of Machine Learning Seminar organized by Computational Linguistics Laboratory at Kazan Federal University. See http://cll.niimm.ksu.ru/cms/lang/en_US/main/seminars/mlseminar
2. Motivating example
Prices of houses in Portland
Living area (feet2 ) #bedrooms Price (1000$s)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
.
. .
.
. .
2 / 31
3. Motivating example
Plot
How can we predict the prices of other houses as a
function of the size of their living areas?
3 / 31
4. Terminology and notation
x ∈ X – input variables (“features”)
t ∈ T – a target variable
{xn }, n = 1, . . . , N – given N observations of
input variables
(xn , tn ) – a training example
(x1 , t1 ), . . . , (xN , tN ) – a training set
Goal
Find a function y(x) : X → T (“hypothesis“) to
predict the value of t for a new value of x
4 / 31
5. Terminology and notation
When the target variable t is continuous
⇒ a regression problem
In the case of discrete values
⇒ a classification problem
5 / 31
6. Terminology and notation
Loss function
L(t, y(x)) – loss function or cost function
In the case of regression problems expected loss is
given by:
E[L] = L(t, y(x))p(x, t) dx dt
R X
Example
Squared loss:
1
L(t, y(x)) = (y(x) − t)2
2
6 / 31
7. Linear basis function models
Linear regression
y(x, w) = w0 + w1 x1 + · · · + wD xD ,
where x = (x1 , . . . , xD )
In our example,
y(x, w) = w0 + w1 x1 + w2 x2
Living area # of bedrooms
7 / 31
8. Linear basis function models
Basis functions
Generally
M −1
y(x, w) = wj φj (x) = wT φ(x)
j=0
where φj (x) are known as basis functions.
Typically, φ0 (x) = 1, so that w0 acts as a bias.
In the simplest case, we use linear basis functions:
φd (x) = xd .
8 / 31
9. Linear basis function models
Polynomial basis functions
1
Polynomial basis functions: 0.5
φj (x) = xj . 0
These are global; a small −0.5
change in x affects all basis
functions. −1
−1 0 1
9 / 31
10. Linear basis function models
Gaussian basis functions
Gaussian basis functions: 1
(x − µj )2 0.75
φj (x) = exp −
2s2
0.5
These are local; a small
change in x only affects 0.25
nearby basis functions. µj
and s control location and 0
−1 0 1
scale (width).
10 / 31
11. Linear basis function models
Sigmoidal basis functions
Sigmoidal basis functions:
x − µj
φj (x) = σ 1
s
0.75
where
1 0.5
σ(a) = .
1 + exp (−a)
0.25
Also these are local; a small
0
change in x only affects −1 0 1
nearby basis functions. µj
and s control location and
scale (slope).
11 / 31
12. Probabilistic interpretation
Assume observations from a deterministic function with added
Gaussian noise:
t = y(x, w) + , where p( |β) = N ( |0, β −1 )
which is the same as saying,
p(t|x, w, β) = N (t|y(x, w), β −1 ).
12 / 31
13. Probabilistic interpretation
Optimal prediction for a squared loss
Expected loss:
E[L] = (y(x) − t)2 p(x, t)dxdt,
which is minimized by the conditional mean
y(x) = Et [t|x]
In our case of a Gaussian conditional distribution, it is
E[t|x] = tp(t|x)dt = y(x, w)
13 / 31
15. Maximum likelihood and least squares
Given observed inputs X = {x1 , . . . , xN }, and targets
t = [t1 , . . . , tN ]T , we obtain the likelihood function
N
p(t|X, w, β) = N (tn |wT φ(xn ), β −1 ).
n=1
Taking the logarithm, we get
N
ln p(t|w, β) = ln N (tn |wT φ(xn ), β −1 )
n=1
N N
= ln β − ln(2π) − βED (w)
2 2
where
N
1
ED (w) = (tn − wT φ(xn ))2
2 n=1
is the sum-of-squares error.
15 / 31
16. Maximum likelihood and least squares
Computing the gradient and setting it to zero yields
N
w ln p(t|w, β) = β (tn − wT φ(xn ))φ(xn )T = 0.
n=1
Solving for w, we get
Moore-Penrose
pseudo-inverse
wM L = (ΦT Φ)−1 ΦT t
where
φ0 (x1 ) φ1 (x1 ) . . . φM −1 (x1 )
φ0 (x2 ) φ1 (x2 ) . . . φM −1 (x2 )
Φ= .
. .
. .. .
.
.
. . . .
φ0 (xN ) φ1 (xN ) . . . φM −1 (xN )
16 / 31
17. Maximum likelihood and least squares
Bias parameter
Rewritten error function:
N M −1
1
ED (w) = (tn − w0 − wj φj (xn ))2
2 n=1 j=1
Setting the derivative w.r.t. w0 equal to zero, we obtain
M −1
¯
w0 = t − ¯
wj φj
j=1
where
N
¯ 1
t= ¯
tn , φj = 1 N
n=1 φj (xn ).
N
N n=1
17 / 31
18. Maximum likelihood and least squares
N N
ln p(t|wM L , β) = ln β − ln(2π) − βED (wM L )
2 2
Maximizing the log likelihood function w.r.t. the
noise precision parameter β, we obtain
N
1 1
= (tn − wM L φ(xn ))2
T
βM L N n=1
18 / 31
19. Geometry of least squares
Consider
y = ΦwM L = [ϕ1 , . . . , ϕM ]wM L . S
t
y∈S ⊆T, t∈T ϕ2
ϕ1 y
S is spanned by ϕ1 , . . . , ϕM .
wM L minimizes the distance
between the distance t between
and t and its orthogonal
projection on S, i.e. y.
19 / 31
20. Batch learning
Batch gradient descent
Consider the gradient descent algorithm, which
starts with some initial w(0) :
w(τ +1) = w(τ ) − η ED
N
(τ )
= w +η (tn − w(τ )T φ(xn ))φ(xn ).
n=1
This is known as the least-mean-squares (LMS)
algorithm.
20 / 31
21. Batch gradient descent
Example calculation
In the case of ordinary least squares and the only leaving area
(0) (0)
feature, we start from w0 = 48, w1 = 30 ...
21 / 31
23. Sequential learning
Data items considered one at a time (a.k.a. online
learning); use stochastic (sequential) gradient
descent:
w(τ +1) = w(τ ) − η En
= w(τ ) + η(tn − w(τ )T φ(xn ))φ(xn ).
Issue: how to choose η?
23 / 31
26. Regularized least squares
Consider the error function:
ED (w) + λEW (w)
Data term + Regularization term
λ is called the regularization coefficient
With the sum-of-squares error function and a quadratic
regularizer, we get
N
1 λ
(tn − wT φ(xn ))2 + wT w
2 n=1
2
which is minimized by
w = (λI + ΦT Φ)−1 ΦT t.
26 / 31
27. Regularized least squares
With a more general regularizer, we have
N M
1 T 2 λ
(tn − w φ(xn )) + |wj |q
2 n=1
2 j=1
q = 0.5 q=1 q=2 q=4
Lasso Quadratic
27 / 31
28. Regularized least squares
Lasso tends to generate sparser solutions than a
quadratic regularizer
w2 w2
w w
w1 w1
28 / 31
29. Multiple outputs
Analogously to the single output case we have:
p(t|x, W, β) = N (t|y(W, x), β −1 I))
= N (t|WT φ(x), β −1 I)).
Given observed inputs X = {x, . . . , xN }, and targets
T = [t1 , . . . , tN ]T , we obtain the log likelihood function
N
ln p(T|X, W, β) = ln N (tn |WT φ(xn ), β −1 I)
n=1
N
NK β β
= ln − ||tn − WT φ(xn )||2 .
2 2π 2 n=1
29 / 31
30. Multiple outputs
Maximizing w.r.t. W, we obtain
WM L = (ΦT Φ)−1 ΦT T
If we consider a single target variable tk , we see that
wk = (ΦT Φ)−1 ΦT tk
where tk = [t1k , . . . , tN k ]T , which is identical with
the single output case.
30 / 31