linear regression part 2

Introduction to Statistical
Machine Learning

c 2011
Christfried Webers
NICTA
The Australian National
University

Introduction to Statistical Machine Learning

Christfried Webers

Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University

Canberra
February – June 2011

(Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning")

1of 229

Machine Learning

c 2011
Christfried Webers
NICTA
University

Part VI
Bayesian Regression

Sequential Update of the
Posterior

Linear Regression 2 Predictive Distribution

Proof of the Predictive
Distribution

Predictive Distribution
with Simpliﬁed Prior

Limitations of Linear
Basis Function Models

198of 229


Bayesian Regression Machine Learning

c 2011
Christfried Webers
NICTA
Training Phase University

training model with adjustable
data x parameter w

Bayesian Regression

training Posterior
targets
t Predictive Distribution

Distribution

fix the most appropriate w* with Simplified Prior
Test Phase Limitations of Linear

test
test model with fixed
target
data x parameter w*
t

199of 229



c 2011
Christfried Webers
NICTA
University
Bayes Theorem

likelihood × prior p(t | w) p(w)
posterior = p(w | t) =
normalisation p(t)
Bayesian Regression
likelihood for i.i.d. data Sequential Update of the
Posterior
N Predictive Distribution
p(t | w) = N (tn | y(xn , w), β −1 ) Proof of the Predictive
Distribution
n=1
N with Simpliﬁed Prior
T −1
= N (tn | w φ(xn ), β ) Limitations of Linear
n=1
1
= const × exp{−β (t − Φw)T (t − Φw)}
2
where we left out the conditioning on x (always assumed),
and β, which is assumed to be constant.

200of 229


How to choose a prior? Machine Learning

c 2011
Christfried Webers
NICTA
University

Can we find a prior for the given likelihood which
makes sense for the problem at hand Bayesian Regression
allows us to find a posterior in a ’nice’ form Sequential Update of the
Posterior
An answer to the second question: Predictive Distribution

Definition ( Conjugate Prior) Distribution

A class of prior probability distributions p(w) is conjugate to a with Simplified Prior

class of likelihood functions p(x | w) if the resulting posterior Limitations of Linear
distributions p(w | x) are in the same family as p(w).

201of 229


Examples of Conjugate Prior Distributions Machine Learning

c 2011
Christfried Webers
NICTA
University

Table: Discrete likelihood distributions
Likelihood Conjugate Prior
Bernoulli Beta
Bayesian Regression
Binomial Beta
Poisson Gamma Posterior

Multinomial Dirichlet Predictive Distribution

Distribution

Table: Continuous likelihood distributions Limitations of Linear
Likelihood Conjugate Prior
Uniform Pareto
Exponential Gamma
Normal Normal
Multivariate normal Multivariate normal

202of 229


Conjugate Prior to a Gaussian Distribution Machine Learning

c 2011
Christfried Webers
NICTA
University
Example : The Gaussian family is conjugate to itself with
respect to a Gaussian likelihood function: if the likelihood
function is Gaussian, choosing a Gaussian prior will
ensure that the posterior distribution is also Gaussian.
Bayesian Regression
Given a marginal distribution for x and a conditional
Gaussian distribution for y given x in the form Posterior


p(x) = N (x | µ, Λ−1 ) Proof of the Predictive
Distribution
−1
p(y | x) = N (y | Ax + b, L ) Predictive Distribution

we get Basis Function Models

p(y) = N (y | Aµ + b, L−1 + AΛ−1 AT )
p(x | y) = N (x | Σ{AT L(y − b) + Λµ}, Σ)

where Σ = (Λ + AT L A)−1 .

203of 229



c 2011
Christfried Webers
NICTA
Choose a Gaussian prior with mean m0 and covariance S0 University

p(w) = N (w | m0 , S0 )

After having seen N training data pairs (xn , tn ), the
posterior for the given likelihood is now Bayesian Regression

Posterior
p(w | t) = N (w | mN , SN ) Predictive Distribution

where Distribution

mN = SN (S−1 m0 + βΦT t)
0 Limitations of Linear
S−1 = S−1 + βΦT Φ
N 0

The posterior is Gaussian, therefore mode = mean.
The maximum posterior weight vector wMAP = mN .
Assume inﬁnitely broad prior S0 = α−1 I with α → 0, the
mean reduces to the maximum likelihood wML .
204of 229



c 2011
Christfried Webers
NICTA
University

If we have not yet seen any data point ( N = 0), the Bayesian Regression
posterior is equal to the prior. Sequential Update of the
Posterior
Sequential arrival of data points : Each posterior Predictive Distribution
distribution calculated after the arrival of a data point and Proof of the Predictive
target value, acts as the prior distribution for the Distribution

subsequent data point. with Simpliﬁed Prior

Nicely ﬁts a sequential learning framework. Limitations of Linear

205of 229



c 2011
Christfried Webers
NICTA
University
Special simpliﬁed prior in the remainder, m0 = 0 and
S0 = α−1 I,
p(x | α) = N (x | 0, α−1 I) (1)
The parameters of the posterior distribution
p(w | t) = N (w | mn , SN ) are now Bayesian Regression

Posterior
T
mN = βSN Φ t Predictive Distribution

S−1 = αI + βΦT Φ
N
Distribution

For α → 0 we get with Simpliﬁed Prior

mN → wML = (ΦT Φ)−1 ΦT t

Log of posterior is sum of log likelihood and log of prior

β α
ln p(w | t) = − (t − Φw)T (t − Φw) − wT w + const
2 2

206of 229



c 2011
Christfried Webers
NICTA
University

Log of posterior is sum of log likelihood and log of prior

1
ln p(w | t) = − β (t − Φw)T (t − Φw) Bayesian Regression
2 Sequential Update of the
Posterior
sum-of-squares-error
α Predictive Distribution

− wT w + const Proof of the Predictive
2 Distribution
quadr. regulariser Predictive Distribution

Maximising the posterior distribution with respect to w Limitations of Linear
corresponds to minimising the sum-of-squares error
function with the addition of a quadratic regularisation term
λ = α/β.

207of 229


Sequential Update of the Posterior Machine Learning

c 2011
Christfried Webers
NICTA
University

Example of a linear basis function model
Single input x, single output t Bayesian Regression

Linear model y(x, w) = w0 + w1 x. Sequential Update of the
Posterior
Data creation Predictive Distribution
1 Choose an xn from the uniform distribution U(x | − 1, 1). Proof of the Predictive
Distribution
2 Calculate f (xn , a) = a0 + a1 xn , where a0 = −0.3, a1 = 0.5.
3 Add Gaussian noise with standard deviation σ = 0.2, with Simpliﬁed Prior

tn = N (xn | f (xn , a), 0.04) Basis Function Models

Set the precision of the uniform prior to α = 2.0.

208of 229



c 2011
Christfried Webers
NICTA
University

Bayesian Regression

Posterior


Distribution



209of 229



c 2011
Christfried Webers
NICTA
University

Bayesian Regression

Posterior


Distribution



210of 229


Predictive Distribution Machine Learning

c 2011
Christfried Webers
NICTA
University

In the training phase, data x and targets t are provided
Bayesian Regression
In the test phase, a new data value x is given and the
corresponding target value t is asked for Posterior

Bayesian approach: Find the probability of the test target t
given the test data x, the training data x and the training Distribution

targets t Predictive Distribution
p(t | x, x, t) Limitations of Linear
This is the Predictive Distribution.

211of 229


How to calculate the Predictive Distribution? Machine Learning

c 2011
Christfried Webers
NICTA
Introduce the model parameter w via the sum rule The Australian National
University

p(t | x, x, t) = p(t, w | x, x, t)dw

= p(t | w, x, x, t)p(w | x, x, t)dw
Bayesian Regression

The test target t depends only on the test data x and the Sequential Update of the
Posterior
model parameter w, but not on the training data and the Predictive Distribution

training targets Proof of the Predictive
Distribution

p(t | w, x, x, t) = p(t | w, x) Predictive Distribution

The model parameter w are learned with the training data Limitations of Linear
x and the training targets t only
p(w | x, x, t) = p(w | x, t)

p(t | x, x, t) = p(t | w, x)p(w | x, t)dw
212of 229


Proof of the Predictive Distribution Machine Learning

c 2011
Christfried Webers
NICTA
How to prove the Predictive Distribution in the general University

form?

p(t | x, x, t) = p(t | w, x, x, t)p(w | x, x, t)dw

Bayesian Regression
Convert each conditional probability on the right-hand-side Sequential Update of the
into a joint probability. Posterior


p(t | w, x, x, t)p(w | x, x, t)dw Distribution

p(t, w, x, x, t) p(w, x, x, t) with Simpliﬁed Prior

= dw Limitations of Linear
p(w, x, x, t) p(x, x, t) Basis Function Models

p(t, w, x, x, t)
= dw
p(x, x, t)
p(t, x, x, t)
=
p(x, x, t)
= p(t | x, x, t)
213of 229


Predictive Distribution with Simpliﬁed Prior Machine Learning

c 2011
Christfried Webers
NICTA
Find the predictive distribution The Australian National
University

p(t | t, α, β) = p(t | w, β) p(w | t, α, β)dw

(remember : The conditioning on the input variables x is
often suppressed to simplify the notation.) Bayesian Regression

Now we know Posterior

N
T −1 Proof of the Predictive
p(t | x, w, β) = N (tn | w φ(xn ), β ) Distribution

n=1 Predictive Distribution

and the posterior was Limitations of Linear

p(w | t, α, β) = N (w | mN , SN )

where

mN = βSN ΦT t
S−1 = αI + βΦT Φ
N
214of 229



c 2011
Christfried Webers
NICTA
University

If we do the convolution of the two Gaussians, we get for
the predictive distribution Bayesian Regression

Posterior
2
p(t | x, t, α, β) = N (t | mT φ(x), σN (x))
N Predictive Distribution

2
where the variance σN (x) is given by Distribution

2 1
σN (x) = + φ(x)T SN φ(x). Limitations of Linear
β Basis Function Models

215of 229



c 2011
Christfried Webers
NICTA
University
Example with artiﬁcial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 1.

1 Bayesian Regression

t Sequential Update of the
Posterior


0 Proof of the Predictive
Distribution


−1 Basis Function Models

0 x 1

Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
216of 229



c 2011
Christfried Webers
NICTA
University


Posterior


Distribution



0 x 1

217of 229



c 2011
Christfried Webers
NICTA
University


Posterior


Distribution



0 x 1

218of 229



c 2011
Christfried Webers
NICTA
University


Posterior


Distribution



0 x 1

219of 229



c 2011
Christfried Webers
NICTA
University

Plots of the function y(x, w) using samples from the posterior
distribution over w. Number of data points N = 1.
Bayesian Regression

Posterior
1 1
t t
Distribution
0 0 Predictive Distribution

−1 −1

0 x 1 0 x 1

220of 229



c 2011
Christfried Webers
NICTA
University

Bayesian Regression

Posterior
1 1
t t
Distribution

−1 −1

0 x 1 0 x 1

221of 229



c 2011
Christfried Webers
NICTA
University

Bayesian Regression

Posterior
1 1
t t
Distribution

−1 −1

0 x 1 0 x 1

222of 229



c 2011
Christfried Webers
NICTA
University

Bayesian Regression

Posterior
1 1
t t
Distribution

−1 −1

0 x 1 0 x 1

223of 229


Limitations of Linear Basis Function Models Machine Learning

c 2011
Christfried Webers
NICTA
University

Basis function φj (x) are fixed before the training data set is
observed.
Curse of dimensionality : Number of basis function grows
Bayesian Regression
rapidly, often exponentially, with the dimensionality D.
But typical data sets have two nice properties which can Posterior

be exploited if the basis functions are not fixed : Predictive Distribution

Data lie close to a nonlinear manifold with intrinsic Distribution
dimension much smaller than D. Need algorithms which Predictive Distribution
place basis functions only where data are (e.g. radial basis
function networks, support vector machines, relevance Basis Function Models
vector machines, neural networks).
Target variables may only depend on a few significant
directions within the data manifold. Need algorithms which
can exploit this property (Neural networks).

224of 229


Curse of Dimensionality Machine Learning

c 2011
Christfried Webers
NICTA
University

Linear Algebra allows us to operate in n-dimensional
vector spaces using the intution from our 3-dimensional
world as a vector space. No surprises as long as n is ﬁnite.
If we add more structure to a vector space (e.g. inner Bayesian Regression

product, metric), our intution gained from the Sequential Update of the
Posterior
3-dimensional world around us may be wrong. Predictive Distribution

Example: Sphere of radius r = 1. What is the fraction of Proof of the Predictive
Distribution
the volume of the sphere in a D-dimensional space which
lies between radius r = 1 and r = 1 − ? with Simpliﬁed Prior

Volume scales like rD , therefore the formula for the volume Limitations of Linear

of a sphere is VD (r) = KD rD .

VD (1) − VD (1 − )
= 1 − (1 − )D
VD (1)

225of 229



c 2011
Christfried Webers
NICTA
Fraction of the volume of the sphere in a D-dimensional University

space which lies between radius r = 1 and r = 1 −

VD (1) − VD (1 − )
= 1 − (1 − )D
VD (1)
Bayesian Regression

1 Posterior
D = 20
0.8 D =5 Proof of the Predictive
Distribution
D =2
volume fraction

0.6 with Simpliﬁed Prior
D =1
0.4

0.2

0
0 0.2 0.4 0.6 0.8 1

226of 229



c 2011
Christfried Webers
NICTA
University

Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.
2
Bayesian Regression

D=1 Sequential Update of the
Posterior

D=2
p(r)

1 Distribution
D = 20


0
0 2 4
r

227of 229



c 2011
Christfried Webers
NICTA
Probability density with respect to radius r of a Gaussian The Australian National
University
distribution for various values of the dimensionality D.
Example: D = 2; assume µ = 0, Σ = I
1 1 1 1 2 2
N (x | 0, I) = exp − xT x = exp − (x1 + x2 )
2π 2 2π 2 Bayesian Regression

Posterior
Coordinate transformation

x1 = r cos(φ) x2 = r sin(φ) Distribution

Probability in the new coordinates with Simpliﬁed Prior

p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |

where | J | = r is the determinant of the Jacobian for the
given coordinate transformation.

1 1
p(r, φ | 0, I) = r exp − r2
2π 2
228of 229



c 2011
Christfried Webers
NICTA
Probability density with respect to radius r of a Gaussian The Australian National
University
distribution for D = 2 (and µ = 0, Σ = I)

1 1
p(r, φ | 0, I) = r exp − r2
2π 2
Bayesian Regression
Integrate over all angles φ Sequential Update of the
Posterior
2π
1 1 1 Predictive Distribution

p(r | 0, I) = r exp − r2 dφ = r exp − r2 Proof of the Predictive
0 2π 2 2 Distribution

2
D=1 Basis Function Models

D=2
p(r)

1
D = 20

0
0 2 4
r

229of 229

linear regression part 2

Recommended

Recommended

More Related Content

Similar to linear regression part 2

Similar to linear regression part 2 (6)

More from nep_test_account

More from nep_test_account (13)

linear regression part 2