Reading
Pattern Recognition
and Machine Learning
§3.3 (Bayesian Linear Regression)
Christopher M. Bishop
Introduced by: Yusuke Oda (NAIST)
@odashi_t
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
Bayesian Linear Regression
 Maximum Likelihood (ML)
– The number of basis functions (≃ model complexity)
depends on the size of the data set.
– Adds the regularization term to control model complexity.
– How should we determine
the coefficient of regularization term?
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
Bayesian Linear Regression
 Maximum Likelihood (ML)
– Using ML to determine the coefficient of regularization term
... Bad selection
• This always leads to excessively complex models (= over-fitting)
– Using independent hold-out data to determine model complexity
(See §1.3)
... Computationally expensive
... Wasteful of valuable data
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5
In the case of previous slide,
λ always becomes 0
when using ML to determine λ.
Bayesian Linear Regression
 Bayesian treatment of linear regression
– Avoids the over-fitting problem of ML.
– Leads to automatic methods of determining model complexity
using the training data alone.
 What we do?
– Introduces the prior distribution and likelihood .
• Assumes the model parameter as proberbility function.
– Calculates the posterior distribution
using the Bayes' theorem:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
Note: Marginal / Conditional Gaussians
 Marginal Gaussian distribution for
 Conditional Gaussian distribution for given
 Marginal distribution of
 Conditional distribution of given
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8
Given:
Then:
where
Parameter Distribution
 Remember the likelihood function given by §3.1.1:
– This is the exponential of quadratic function of
 The corresponding conjugate prior is given by
a Gaussian distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9
known parameter
Parameter Distribution
 Now given:
 Then the posterior distribution is shown by using (2.116):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
Online Learning- Parameter Distribution
 If data points arrive sequentially,
the design matrix has only 1 row:
 Assuming that are the n-th input data then
we can obtain the formula for online learning:
where
In addition,
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
Easy Gaussian Prior- Parameter Distribution
 If the prior distribution is a zero-mean isotropic Gaussian
governed by a single precision parameter :
 The corresponding posterior distribution is also given:
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
Relationship with MSSE- Parameter Distribution
 The log of the posterior distribution is given:
 If prior distribution is given by (3.52), this result is shown:
– Maximization of (3.55) with respect to
– Minimization of the sum-of-squares error (MSSE) function
with the addition of a quadratic regularization term
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13
Equivalent
Example- Parameter Distribution
 Straight-line fitting
– Model function:
– True function:
– Error:
– Goal: To recover the values of
from such data
– Prior distribution:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
Generalized Gaussian Prior- Parameter Distribution
 We can generalize the
Gaussian prior about exponent.
 In which corresponds
to the Gaussian
and only in the case is the
prior conjugate to the (3.10).
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
Predictive Distribution
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17
 Let's consider that making predictions of directly
for new values of .
 In order to obtain it, we need to evaluate the
predictive distribution:
 This formula is tipically written:
Marginalization arround
(summing out )
Predictive Distribution
 The conditional distribution of the target variable is given:
 And the posterior weight distribution is given:
 Accordingly, the result of (3.57) is shown by using (2.115):
where
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
Predictive Distribution
 Now we discuss the variance of predictive distribution:
– As additional data points are observed, the posterior distribution
becomes narrower:
– 2nd term of the(3.59) goes zero in the limit :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19
Addictive noise
goverened by the parameter .
This term depends on the mapping vector
. of each data point .
Predictive Distribution
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20
Example- Predictive Distribution
 Gaussian regression with sine curve
– Basis functions: 9 Gaussian curves
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21
Mean of predictive distribution
Standard deviation of
predictive distribution
Example- Predictive Distribution
 Gaussian regression with sine curve
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
Example- Predictive Distribution
 Gaussian regression with sine curve
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
Problem of Localized Basis- Predictive Distribution
 Polynominal regression
 Gaussian regression
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24
Which is better?
Problem of Localized Basis- Predictive Distribution
 If we used localized basis function such as Gaussians,
then in regions away from the basis function centers
the contribution from the 2nd term in the (3.59) will goes zero.
 Accordingly, the predictive variance becomes only the noise
contribution . But it is not good result.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25
Large contribution
Small contribution
Problem of Localized Basis- Predictive Distribution
 This problem (arising from choosing localized basis function)
can be avoided by adopting an alternative Bayesian approach
to regression known as a Gaussian process.
– See §6.4.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
Case of Unknown Precision- Predictive Distribution
 If both and are treated as unknown then
we can introduce a conjugate prior distribution and
corresponding posterior distribution as Gaussian-gamma
distribution:
 And then the predictive distribution is given:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
Agenda
 3.3 Bayesian Linear Regression ベイズ線形回帰
– 3.3.1 Parameter distribution パラメータの分布
– 3.3.2 Predictive distribution 予測分布
– 3.3.3 Equivalent kernel 等価カーネル
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
Equivalent Kernel
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29
 If we substitute the posterior mean solution (3.53) into the
expression (3.3), the predictive mean can be written:
 This formula can assume the linear combination of :
Equivalent Kernel
 Where the coefficients of each are given:
 This function is calld smoother matrix or equivalent kernel.
 Regression functions which make predictions by taking linear
combinations of the training set target values are known as
linear smoothers.
 We also predict for new input vector using equivalent
kernel, instead of calculating parameters of basis functions.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
Example 1- Equivalent Kernel
 Equivalent kernel with Gaussian regression
 Equivalen kernel depends on the set of basis function and the
data set.
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
Equivalent Kernel
 Equivalent kernel means the contribution of each data point
for predictive mean.
 The covariance between and can be shown by
equivalent kernel:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32
Large contribution
Small contribution
Properties of Equivalent Kernel- Equivalent Kernel
 Equivalent kernel have localization property even if any basis
functions are not localized.
 Sum of equivalent kernel equals 1 for all :
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33
Polynominal Sigmoid
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
Example 2- Equivalent Kernel
 Equivalent kernel with polynominal regression
– Moving parameter:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
Properties of Equivalent Kernel- Equivalent Kernel
 Equivalent kernel satisfies an important property shared by
kernel functions in general:
– Kernel function can be expressed in the form of an inner product with
respect to a vector of nonlinear functions:
– In the case of equivalent kernel, is given below:
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
Thank you!
2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38
zzz...

Pattern Recognition and Machine Learning: Section 3.3

  • 1.
    Reading Pattern Recognition and MachineLearning §3.3 (Bayesian Linear Regression) Christopher M. Bishop Introduced by: Yusuke Oda (NAIST) @odashi_t 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
  • 2.
    Agenda  3.3 BayesianLinear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
  • 3.
    Agenda  3.3 BayesianLinear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
  • 4.
    Bayesian Linear Regression Maximum Likelihood (ML) – The number of basis functions (≃ model complexity) depends on the size of the data set. – Adds the regularization term to control model complexity. – How should we determine the coefficient of regularization term? 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
  • 5.
    Bayesian Linear Regression Maximum Likelihood (ML) – Using ML to determine the coefficient of regularization term ... Bad selection • This always leads to excessively complex models (= over-fitting) – Using independent hold-out data to determine model complexity (See §1.3) ... Computationally expensive ... Wasteful of valuable data 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5 In the case of previous slide, λ always becomes 0 when using ML to determine λ.
  • 6.
    Bayesian Linear Regression Bayesian treatment of linear regression – Avoids the over-fitting problem of ML. – Leads to automatic methods of determining model complexity using the training data alone.  What we do? – Introduces the prior distribution and likelihood . • Assumes the model parameter as proberbility function. – Calculates the posterior distribution using the Bayes' theorem: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
  • 7.
    Agenda  3.3 BayesianLinear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
  • 8.
    Note: Marginal /Conditional Gaussians  Marginal Gaussian distribution for  Conditional Gaussian distribution for given  Marginal distribution of  Conditional distribution of given 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8 Given: Then: where
  • 9.
    Parameter Distribution  Rememberthe likelihood function given by §3.1.1: – This is the exponential of quadratic function of  The corresponding conjugate prior is given by a Gaussian distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9 known parameter
  • 10.
    Parameter Distribution  Nowgiven:  Then the posterior distribution is shown by using (2.116): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
  • 11.
    Online Learning- ParameterDistribution  If data points arrive sequentially, the design matrix has only 1 row:  Assuming that are the n-th input data then we can obtain the formula for online learning: where In addition, 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
  • 12.
    Easy Gaussian Prior-Parameter Distribution  If the prior distribution is a zero-mean isotropic Gaussian governed by a single precision parameter :  The corresponding posterior distribution is also given: where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
  • 13.
    Relationship with MSSE-Parameter Distribution  The log of the posterior distribution is given:  If prior distribution is given by (3.52), this result is shown: – Maximization of (3.55) with respect to – Minimization of the sum-of-squares error (MSSE) function with the addition of a quadratic regularization term 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13 Equivalent
  • 14.
    Example- Parameter Distribution Straight-line fitting – Model function: – True function: – Error: – Goal: To recover the values of from such data – Prior distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
  • 15.
    Generalized Gaussian Prior-Parameter Distribution  We can generalize the Gaussian prior about exponent.  In which corresponds to the Gaussian and only in the case is the prior conjugate to the (3.10). 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
  • 16.
    Agenda  3.3 BayesianLinear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
  • 17.
    Predictive Distribution 2013/6/5 2013© Yusuke Oda AHC-Lab, IS, NAIST 17  Let's consider that making predictions of directly for new values of .  In order to obtain it, we need to evaluate the predictive distribution:  This formula is tipically written: Marginalization arround (summing out )
  • 18.
    Predictive Distribution  Theconditional distribution of the target variable is given:  And the posterior weight distribution is given:  Accordingly, the result of (3.57) is shown by using (2.115): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
  • 19.
    Predictive Distribution  Nowwe discuss the variance of predictive distribution: – As additional data points are observed, the posterior distribution becomes narrower: – 2nd term of the(3.59) goes zero in the limit : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19 Addictive noise goverened by the parameter . This term depends on the mapping vector . of each data point .
  • 20.
    Predictive Distribution 2013/6/5 2013© Yusuke Oda AHC-Lab, IS, NAIST 20
  • 21.
    Example- Predictive Distribution Gaussian regression with sine curve – Basis functions: 9 Gaussian curves 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21 Mean of predictive distribution Standard deviation of predictive distribution
  • 22.
    Example- Predictive Distribution Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
  • 23.
    Example- Predictive Distribution Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
  • 24.
    Problem of LocalizedBasis- Predictive Distribution  Polynominal regression  Gaussian regression 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24 Which is better?
  • 25.
    Problem of LocalizedBasis- Predictive Distribution  If we used localized basis function such as Gaussians, then in regions away from the basis function centers the contribution from the 2nd term in the (3.59) will goes zero.  Accordingly, the predictive variance becomes only the noise contribution . But it is not good result. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25 Large contribution Small contribution
  • 26.
    Problem of LocalizedBasis- Predictive Distribution  This problem (arising from choosing localized basis function) can be avoided by adopting an alternative Bayesian approach to regression known as a Gaussian process. – See §6.4. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
  • 27.
    Case of UnknownPrecision- Predictive Distribution  If both and are treated as unknown then we can introduce a conjugate prior distribution and corresponding posterior distribution as Gaussian-gamma distribution:  And then the predictive distribution is given: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
  • 28.
    Agenda  3.3 BayesianLinear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
  • 29.
    Equivalent Kernel 2013/6/5 2013© Yusuke Oda AHC-Lab, IS, NAIST 29  If we substitute the posterior mean solution (3.53) into the expression (3.3), the predictive mean can be written:  This formula can assume the linear combination of :
  • 30.
    Equivalent Kernel  Wherethe coefficients of each are given:  This function is calld smoother matrix or equivalent kernel.  Regression functions which make predictions by taking linear combinations of the training set target values are known as linear smoothers.  We also predict for new input vector using equivalent kernel, instead of calculating parameters of basis functions. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
  • 31.
    Example 1- EquivalentKernel  Equivalent kernel with Gaussian regression  Equivalen kernel depends on the set of basis function and the data set. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
  • 32.
    Equivalent Kernel  Equivalentkernel means the contribution of each data point for predictive mean.  The covariance between and can be shown by equivalent kernel: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32 Large contribution Small contribution
  • 33.
    Properties of EquivalentKernel- Equivalent Kernel  Equivalent kernel have localization property even if any basis functions are not localized.  Sum of equivalent kernel equals 1 for all : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33 Polynominal Sigmoid
  • 34.
    Example 2- EquivalentKernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
  • 35.
    Example 2- EquivalentKernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
  • 36.
    Example 2- EquivalentKernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
  • 37.
    Properties of EquivalentKernel- Equivalent Kernel  Equivalent kernel satisfies an important property shared by kernel functions in general: – Kernel function can be expressed in the form of an inner product with respect to a vector of nonlinear functions: – In the case of equivalent kernel, is given below: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
  • 38.
    Thank you! 2013/6/5 2013© Yusuke Oda AHC-Lab, IS, NAIST 38 zzz...