A Few Useful Things to Know about Machine Learning
linear regression part 2
1. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Introduction to Statistical Machine Learning
Christfried Webers
Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University
Canberra
February – June 2011
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
1of 229
2. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Part VI
Bayesian Regression
Sequential Update of the
Posterior
Linear Regression 2 Predictive Distribution
Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
198of 229
3. Introduction to Statistical
Bayesian Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Training Phase University
training model with adjustable
data x parameter w
Bayesian Regression
Sequential Update of the
training Posterior
targets
t Predictive Distribution
Proof of the Predictive
Distribution
Predictive Distribution
fix the most appropriate w* with Simplified Prior
Test Phase Limitations of Linear
Basis Function Models
test
test model with fixed
target
data x parameter w*
t
199of 229
4. Introduction to Statistical
Bayesian Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Bayes Theorem
likelihood × prior p(t | w) p(w)
posterior = p(w | t) =
normalisation p(t)
Bayesian Regression
likelihood for i.i.d. data Sequential Update of the
Posterior
N Predictive Distribution
p(t | w) = N (tn | y(xn , w), β −1 ) Proof of the Predictive
Distribution
n=1
Predictive Distribution
N with Simplified Prior
T −1
= N (tn | w φ(xn ), β ) Limitations of Linear
Basis Function Models
n=1
1
= const × exp{−β (t − Φw)T (t − Φw)}
2
where we left out the conditioning on x (always assumed),
and β, which is assumed to be constant.
200of 229
5. Introduction to Statistical
How to choose a prior? Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Can we find a prior for the given likelihood which
makes sense for the problem at hand Bayesian Regression
allows us to find a posterior in a ’nice’ form Sequential Update of the
Posterior
An answer to the second question: Predictive Distribution
Proof of the Predictive
Definition ( Conjugate Prior) Distribution
Predictive Distribution
A class of prior probability distributions p(w) is conjugate to a with Simplified Prior
class of likelihood functions p(x | w) if the resulting posterior Limitations of Linear
Basis Function Models
distributions p(w | x) are in the same family as p(w).
201of 229
6. Introduction to Statistical
Examples of Conjugate Prior Distributions Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Table: Discrete likelihood distributions
Likelihood Conjugate Prior
Bernoulli Beta
Bayesian Regression
Binomial Beta
Sequential Update of the
Poisson Gamma Posterior
Multinomial Dirichlet Predictive Distribution
Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Table: Continuous likelihood distributions Limitations of Linear
Basis Function Models
Likelihood Conjugate Prior
Uniform Pareto
Exponential Gamma
Normal Normal
Multivariate normal Multivariate normal
202of 229
7. Introduction to Statistical
Conjugate Prior to a Gaussian Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Example : The Gaussian family is conjugate to itself with
respect to a Gaussian likelihood function: if the likelihood
function is Gaussian, choosing a Gaussian prior will
ensure that the posterior distribution is also Gaussian.
Bayesian Regression
Given a marginal distribution for x and a conditional
Sequential Update of the
Gaussian distribution for y given x in the form Posterior
Predictive Distribution
p(x) = N (x | µ, Λ−1 ) Proof of the Predictive
Distribution
−1
p(y | x) = N (y | Ax + b, L ) Predictive Distribution
with Simplified Prior
Limitations of Linear
we get Basis Function Models
p(y) = N (y | Aµ + b, L−1 + AΛ−1 AT )
p(x | y) = N (x | Σ{AT L(y − b) + Λµ}, Σ)
where Σ = (Λ + AT L A)−1 .
203of 229
8. Introduction to Statistical
Bayesian Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Choose a Gaussian prior with mean m0 and covariance S0 University
p(w) = N (w | m0 , S0 )
After having seen N training data pairs (xn , tn ), the
posterior for the given likelihood is now Bayesian Regression
Sequential Update of the
Posterior
p(w | t) = N (w | mN , SN ) Predictive Distribution
Proof of the Predictive
where Distribution
Predictive Distribution
with Simplified Prior
mN = SN (S−1 m0 + βΦT t)
0 Limitations of Linear
Basis Function Models
S−1 = S−1 + βΦT Φ
N 0
The posterior is Gaussian, therefore mode = mean.
The maximum posterior weight vector wMAP = mN .
Assume infinitely broad prior S0 = α−1 I with α → 0, the
mean reduces to the maximum likelihood wML .
204of 229
9. Introduction to Statistical
Bayesian Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
If we have not yet seen any data point ( N = 0), the Bayesian Regression
posterior is equal to the prior. Sequential Update of the
Posterior
Sequential arrival of data points : Each posterior Predictive Distribution
distribution calculated after the arrival of a data point and Proof of the Predictive
target value, acts as the prior distribution for the Distribution
Predictive Distribution
subsequent data point. with Simplified Prior
Nicely fits a sequential learning framework. Limitations of Linear
Basis Function Models
205of 229
10. Introduction to Statistical
Bayesian Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Special simplified prior in the remainder, m0 = 0 and
S0 = α−1 I,
p(x | α) = N (x | 0, α−1 I) (1)
The parameters of the posterior distribution
p(w | t) = N (w | mn , SN ) are now Bayesian Regression
Sequential Update of the
Posterior
T
mN = βSN Φ t Predictive Distribution
S−1 = αI + βΦT Φ
N
Proof of the Predictive
Distribution
Predictive Distribution
For α → 0 we get with Simplified Prior
Limitations of Linear
Basis Function Models
mN → wML = (ΦT Φ)−1 ΦT t
Log of posterior is sum of log likelihood and log of prior
β α
ln p(w | t) = − (t − Φw)T (t − Φw) − wT w + const
2 2
206of 229
11. Introduction to Statistical
Bayesian Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Log of posterior is sum of log likelihood and log of prior
1
ln p(w | t) = − β (t − Φw)T (t − Φw) Bayesian Regression
2 Sequential Update of the
Posterior
sum-of-squares-error
α Predictive Distribution
− wT w + const Proof of the Predictive
2 Distribution
quadr. regulariser Predictive Distribution
with Simplified Prior
Maximising the posterior distribution with respect to w Limitations of Linear
Basis Function Models
corresponds to minimising the sum-of-squares error
function with the addition of a quadratic regularisation term
λ = α/β.
207of 229
12. Introduction to Statistical
Sequential Update of the Posterior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Example of a linear basis function model
Single input x, single output t Bayesian Regression
Linear model y(x, w) = w0 + w1 x. Sequential Update of the
Posterior
Data creation Predictive Distribution
1 Choose an xn from the uniform distribution U(x | − 1, 1). Proof of the Predictive
Distribution
2 Calculate f (xn , a) = a0 + a1 xn , where a0 = −0.3, a1 = 0.5.
Predictive Distribution
3 Add Gaussian noise with standard deviation σ = 0.2, with Simplified Prior
Limitations of Linear
tn = N (xn | f (xn , a), 0.04) Basis Function Models
Set the precision of the uniform prior to α = 2.0.
208of 229
13. Introduction to Statistical
Sequential Update of the Posterior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Bayesian Regression
Sequential Update of the
Posterior
Predictive Distribution
Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
209of 229
14. Introduction to Statistical
Sequential Update of the Posterior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Bayesian Regression
Sequential Update of the
Posterior
Predictive Distribution
Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
210of 229
15. Introduction to Statistical
Predictive Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
In the training phase, data x and targets t are provided
Bayesian Regression
In the test phase, a new data value x is given and the
Sequential Update of the
corresponding target value t is asked for Posterior
Predictive Distribution
Bayesian approach: Find the probability of the test target t
Proof of the Predictive
given the test data x, the training data x and the training Distribution
targets t Predictive Distribution
with Simplified Prior
p(t | x, x, t) Limitations of Linear
Basis Function Models
This is the Predictive Distribution.
211of 229
16. Introduction to Statistical
How to calculate the Predictive Distribution? Machine Learning
c 2011
Christfried Webers
NICTA
Introduce the model parameter w via the sum rule The Australian National
University
p(t | x, x, t) = p(t, w | x, x, t)dw
= p(t | w, x, x, t)p(w | x, x, t)dw
Bayesian Regression
The test target t depends only on the test data x and the Sequential Update of the
Posterior
model parameter w, but not on the training data and the Predictive Distribution
training targets Proof of the Predictive
Distribution
p(t | w, x, x, t) = p(t | w, x) Predictive Distribution
with Simplified Prior
The model parameter w are learned with the training data Limitations of Linear
Basis Function Models
x and the training targets t only
p(w | x, x, t) = p(w | x, t)
Predictive Distribution
p(t | x, x, t) = p(t | w, x)p(w | x, t)dw
212of 229
17. Introduction to Statistical
Proof of the Predictive Distribution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
How to prove the Predictive Distribution in the general University
form?
p(t | x, x, t) = p(t | w, x, x, t)p(w | x, x, t)dw
Bayesian Regression
Convert each conditional probability on the right-hand-side Sequential Update of the
into a joint probability. Posterior
Predictive Distribution
Proof of the Predictive
p(t | w, x, x, t)p(w | x, x, t)dw Distribution
Predictive Distribution
p(t, w, x, x, t) p(w, x, x, t) with Simplified Prior
= dw Limitations of Linear
p(w, x, x, t) p(x, x, t) Basis Function Models
p(t, w, x, x, t)
= dw
p(x, x, t)
p(t, x, x, t)
=
p(x, x, t)
= p(t | x, x, t)
213of 229
18. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
Find the predictive distribution The Australian National
University
p(t | t, α, β) = p(t | w, β) p(w | t, α, β)dw
(remember : The conditioning on the input variables x is
often suppressed to simplify the notation.) Bayesian Regression
Sequential Update of the
Now we know Posterior
Predictive Distribution
N
T −1 Proof of the Predictive
p(t | x, w, β) = N (tn | w φ(xn ), β ) Distribution
n=1 Predictive Distribution
with Simplified Prior
and the posterior was Limitations of Linear
Basis Function Models
p(w | t, α, β) = N (w | mN , SN )
where
mN = βSN ΦT t
S−1 = αI + βΦT Φ
N
214of 229
19. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
If we do the convolution of the two Gaussians, we get for
the predictive distribution Bayesian Regression
Sequential Update of the
Posterior
2
p(t | x, t, α, β) = N (t | mT φ(x), σN (x))
N Predictive Distribution
Proof of the Predictive
2
where the variance σN (x) is given by Distribution
Predictive Distribution
with Simplified Prior
2 1
σN (x) = + φ(x)T SN φ(x). Limitations of Linear
β Basis Function Models
215of 229
20. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 1.
1 Bayesian Regression
t Sequential Update of the
Posterior
Predictive Distribution
0 Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Limitations of Linear
−1 Basis Function Models
0 x 1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
216of 229
21. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 2.
1 Bayesian Regression
t Sequential Update of the
Posterior
Predictive Distribution
0 Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Limitations of Linear
−1 Basis Function Models
0 x 1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
217of 229
22. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 4.
1 Bayesian Regression
t Sequential Update of the
Posterior
Predictive Distribution
0 Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Limitations of Linear
−1 Basis Function Models
0 x 1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
218of 229
23. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Example with artificial sinusoidal data from sin(2πx) (green)
and added noise. Number of data points N = 25.
1 Bayesian Regression
t Sequential Update of the
Posterior
Predictive Distribution
0 Proof of the Predictive
Distribution
Predictive Distribution
with Simplified Prior
Limitations of Linear
−1 Basis Function Models
0 x 1
Mean of the predictive distribution (red) and regions of one
standard deviation from mean (red shaded).
219of 229
24. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Plots of the function y(x, w) using samples from the posterior
distribution over w. Number of data points N = 1.
Bayesian Regression
Sequential Update of the
Posterior
1 1
Predictive Distribution
t t
Proof of the Predictive
Distribution
0 0 Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
−1 −1
0 x 1 0 x 1
220of 229
25. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Plots of the function y(x, w) using samples from the posterior
distribution over w. Number of data points N = 2.
Bayesian Regression
Sequential Update of the
Posterior
1 1
Predictive Distribution
t t
Proof of the Predictive
Distribution
0 0 Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
−1 −1
0 x 1 0 x 1
221of 229
26. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Plots of the function y(x, w) using samples from the posterior
distribution over w. Number of data points N = 4.
Bayesian Regression
Sequential Update of the
Posterior
1 1
Predictive Distribution
t t
Proof of the Predictive
Distribution
0 0 Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
−1 −1
0 x 1 0 x 1
222of 229
27. Introduction to Statistical
Predictive Distribution with Simplified Prior Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Plots of the function y(x, w) using samples from the posterior
distribution over w. Number of data points N = 25.
Bayesian Regression
Sequential Update of the
Posterior
1 1
Predictive Distribution
t t
Proof of the Predictive
Distribution
0 0 Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
−1 −1
0 x 1 0 x 1
223of 229
28. Introduction to Statistical
Limitations of Linear Basis Function Models Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Basis function φj (x) are fixed before the training data set is
observed.
Curse of dimensionality : Number of basis function grows
Bayesian Regression
rapidly, often exponentially, with the dimensionality D.
Sequential Update of the
But typical data sets have two nice properties which can Posterior
be exploited if the basis functions are not fixed : Predictive Distribution
Proof of the Predictive
Data lie close to a nonlinear manifold with intrinsic Distribution
dimension much smaller than D. Need algorithms which Predictive Distribution
with Simplified Prior
place basis functions only where data are (e.g. radial basis
Limitations of Linear
function networks, support vector machines, relevance Basis Function Models
vector machines, neural networks).
Target variables may only depend on a few significant
directions within the data manifold. Need algorithms which
can exploit this property (Neural networks).
224of 229
29. Introduction to Statistical
Curse of Dimensionality Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Linear Algebra allows us to operate in n-dimensional
vector spaces using the intution from our 3-dimensional
world as a vector space. No surprises as long as n is finite.
If we add more structure to a vector space (e.g. inner Bayesian Regression
product, metric), our intution gained from the Sequential Update of the
Posterior
3-dimensional world around us may be wrong. Predictive Distribution
Example: Sphere of radius r = 1. What is the fraction of Proof of the Predictive
Distribution
the volume of the sphere in a D-dimensional space which
Predictive Distribution
lies between radius r = 1 and r = 1 − ? with Simplified Prior
Volume scales like rD , therefore the formula for the volume Limitations of Linear
Basis Function Models
of a sphere is VD (r) = KD rD .
VD (1) − VD (1 − )
= 1 − (1 − )D
VD (1)
225of 229
30. Introduction to Statistical
Curse of Dimensionality Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Fraction of the volume of the sphere in a D-dimensional University
space which lies between radius r = 1 and r = 1 −
VD (1) − VD (1 − )
= 1 − (1 − )D
VD (1)
Bayesian Regression
Sequential Update of the
1 Posterior
D = 20
Predictive Distribution
0.8 D =5 Proof of the Predictive
Distribution
D =2
volume fraction
Predictive Distribution
0.6 with Simplified Prior
D =1
Limitations of Linear
Basis Function Models
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
226of 229
31. Introduction to Statistical
Curse of Dimensionality Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Probability density with respect to radius r of a Gaussian
distribution for various values of the dimensionality D.
2
Bayesian Regression
D=1 Sequential Update of the
Posterior
Predictive Distribution
D=2
p(r)
Proof of the Predictive
1 Distribution
D = 20
Predictive Distribution
with Simplified Prior
Limitations of Linear
Basis Function Models
0
0 2 4
r
227of 229
32. Introduction to Statistical
Curse of Dimensionality Machine Learning
c 2011
Christfried Webers
NICTA
Probability density with respect to radius r of a Gaussian The Australian National
University
distribution for various values of the dimensionality D.
Example: D = 2; assume µ = 0, Σ = I
1 1 1 1 2 2
N (x | 0, I) = exp − xT x = exp − (x1 + x2 )
2π 2 2π 2 Bayesian Regression
Sequential Update of the
Posterior
Coordinate transformation
Predictive Distribution
Proof of the Predictive
x1 = r cos(φ) x2 = r sin(φ) Distribution
Predictive Distribution
Probability in the new coordinates with Simplified Prior
Limitations of Linear
Basis Function Models
p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J |
where | J | = r is the determinant of the Jacobian for the
given coordinate transformation.
1 1
p(r, φ | 0, I) = r exp − r2
2π 2
228of 229
33. Introduction to Statistical
Curse of Dimensionality Machine Learning
c 2011
Christfried Webers
NICTA
Probability density with respect to radius r of a Gaussian The Australian National
University
distribution for D = 2 (and µ = 0, Σ = I)
1 1
p(r, φ | 0, I) = r exp − r2
2π 2
Bayesian Regression
Integrate over all angles φ Sequential Update of the
Posterior
2π
1 1 1 Predictive Distribution
p(r | 0, I) = r exp − r2 dφ = r exp − r2 Proof of the Predictive
0 2π 2 2 Distribution
Predictive Distribution
with Simplified Prior
2
Limitations of Linear
D=1 Basis Function Models
D=2
p(r)
1
D = 20
0
0 2 4
r
229of 229