Successfully reported this slideshow.

# Pattern Recognition and Machine Learning: Section 3.3

Upcoming SlideShare
Kernighan lin
×

# Pattern Recognition and Machine Learning: Section 3.3

『パターン認識と機械学習』の輪講で用いた資料。

『パターン認識と機械学習』の輪講で用いた資料。

## More Related Content

### Pattern Recognition and Machine Learning: Section 3.3

1. 1. Reading Pattern Recognition and Machine Learning §3.3 (Bayesian Linear Regression) Christopher M. Bishop Introduced by: Yusuke Oda (NAIST) @odashi_t 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
2. 2. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
3. 3. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
4. 4. Bayesian Linear Regression  Maximum Likelihood (ML) – The number of basis functions (≃ model complexity) depends on the size of the data set. – Adds the regularization term to control model complexity. – How should we determine the coefficient of regularization term? 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
5. 5. Bayesian Linear Regression  Maximum Likelihood (ML) – Using ML to determine the coefficient of regularization term ... Bad selection • This always leads to excessively complex models (= over-fitting) – Using independent hold-out data to determine model complexity (See §1.3) ... Computationally expensive ... Wasteful of valuable data 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5 In the case of previous slide, λ always becomes 0 when using ML to determine λ.
6. 6. Bayesian Linear Regression  Bayesian treatment of linear regression – Avoids the over-fitting problem of ML. – Leads to automatic methods of determining model complexity using the training data alone.  What we do? – Introduces the prior distribution and likelihood . • Assumes the model parameter as proberbility function. – Calculates the posterior distribution using the Bayes' theorem: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
7. 7. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
8. 8. Note: Marginal / Conditional Gaussians  Marginal Gaussian distribution for  Conditional Gaussian distribution for given  Marginal distribution of  Conditional distribution of given 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8 Given: Then: where
9. 9. Parameter Distribution  Remember the likelihood function given by §3.1.1: – This is the exponential of quadratic function of  The corresponding conjugate prior is given by a Gaussian distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9 known parameter
10. 10. Parameter Distribution  Now given:  Then the posterior distribution is shown by using (2.116): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
11. 11. Online Learning- Parameter Distribution  If data points arrive sequentially, the design matrix has only 1 row:  Assuming that are the n-th input data then we can obtain the formula for online learning: where In addition, 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
12. 12. Easy Gaussian Prior- Parameter Distribution  If the prior distribution is a zero-mean isotropic Gaussian governed by a single precision parameter :  The corresponding posterior distribution is also given: where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
13. 13. Relationship with MSSE- Parameter Distribution  The log of the posterior distribution is given:  If prior distribution is given by (3.52), this result is shown: – Maximization of (3.55) with respect to – Minimization of the sum-of-squares error (MSSE) function with the addition of a quadratic regularization term 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13 Equivalent
14. 14. Example- Parameter Distribution  Straight-line fitting – Model function: – True function: – Error: – Goal: To recover the values of from such data – Prior distribution: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
15. 15. Generalized Gaussian Prior- Parameter Distribution  We can generalize the Gaussian prior about exponent.  In which corresponds to the Gaussian and only in the case is the prior conjugate to the (3.10). 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
16. 16. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
17. 17. Predictive Distribution 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17  Let's consider that making predictions of directly for new values of .  In order to obtain it, we need to evaluate the predictive distribution:  This formula is tipically written: Marginalization arround (summing out )
18. 18. Predictive Distribution  The conditional distribution of the target variable is given:  And the posterior weight distribution is given:  Accordingly, the result of (3.57) is shown by using (2.115): where 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
19. 19. Predictive Distribution  Now we discuss the variance of predictive distribution: – As additional data points are observed, the posterior distribution becomes narrower: – 2nd term of the(3.59) goes zero in the limit : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19 Addictive noise goverened by the parameter . This term depends on the mapping vector . of each data point .
20. 20. Predictive Distribution 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20
21. 21. Example- Predictive Distribution  Gaussian regression with sine curve – Basis functions: 9 Gaussian curves 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21 Mean of predictive distribution Standard deviation of predictive distribution
22. 22. Example- Predictive Distribution  Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
23. 23. Example- Predictive Distribution  Gaussian regression with sine curve 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
24. 24. Problem of Localized Basis- Predictive Distribution  Polynominal regression  Gaussian regression 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24 Which is better?
25. 25. Problem of Localized Basis- Predictive Distribution  If we used localized basis function such as Gaussians, then in regions away from the basis function centers the contribution from the 2nd term in the (3.59) will goes zero.  Accordingly, the predictive variance becomes only the noise contribution . But it is not good result. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25 Large contribution Small contribution
26. 26. Problem of Localized Basis- Predictive Distribution  This problem (arising from choosing localized basis function) can be avoided by adopting an alternative Bayesian approach to regression known as a Gaussian process. – See §6.4. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
27. 27. Case of Unknown Precision- Predictive Distribution  If both and are treated as unknown then we can introduce a conjugate prior distribution and corresponding posterior distribution as Gaussian-gamma distribution:  And then the predictive distribution is given: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
28. 28. Agenda  3.3 Bayesian Linear Regression ベイズ線形回帰 – 3.3.1 Parameter distribution パラメータの分布 – 3.3.2 Predictive distribution 予測分布 – 3.3.3 Equivalent kernel 等価カーネル 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
29. 29. Equivalent Kernel 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29  If we substitute the posterior mean solution (3.53) into the expression (3.3), the predictive mean can be written:  This formula can assume the linear combination of :
30. 30. Equivalent Kernel  Where the coefficients of each are given:  This function is calld smoother matrix or equivalent kernel.  Regression functions which make predictions by taking linear combinations of the training set target values are known as linear smoothers.  We also predict for new input vector using equivalent kernel, instead of calculating parameters of basis functions. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
31. 31. Example 1- Equivalent Kernel  Equivalent kernel with Gaussian regression  Equivalen kernel depends on the set of basis function and the data set. 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
32. 32. Equivalent Kernel  Equivalent kernel means the contribution of each data point for predictive mean.  The covariance between and can be shown by equivalent kernel: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32 Large contribution Small contribution
33. 33. Properties of Equivalent Kernel- Equivalent Kernel  Equivalent kernel have localization property even if any basis functions are not localized.  Sum of equivalent kernel equals 1 for all : 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33 Polynominal Sigmoid
34. 34. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
35. 35. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
36. 36. Example 2- Equivalent Kernel  Equivalent kernel with polynominal regression – Moving parameter: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
37. 37. Properties of Equivalent Kernel- Equivalent Kernel  Equivalent kernel satisfies an important property shared by kernel functions in general: – Kernel function can be expressed in the form of an inner product with respect to a vector of nonlinear functions: – In the case of equivalent kernel, is given below: 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
38. 38. Thank you! 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38 zzz...