Upcoming SlideShare
×

Pattern Recognition and Machine Learning: Section 3.3

1,299 views
1,205 views

Published on

『パターン認識と機械学習』の輪講で用いた資料。

Published in: Education, Technology
5 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,299
On SlideShare
0
From Embeds
0
Number of Embeds
53
Actions
Shares
0
19
0
Likes
5
Embeds 0
No embeds

No notes for slide

Pattern Recognition and Machine Learning: Section 3.3

1. 1. ReadingPattern Recognitionand Machine Learning§3.3 (Bayesian Linear Regression)Christopher M. BishopIntroduced by: Yusuke Oda (NAIST)@odashi_t2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1
2. 2. Agenda 3.3 Bayesian Linear Regression ベイズ線形回帰– 3.3.1 Parameter distribution パラメータの分布– 3.3.2 Predictive distribution 予測分布– 3.3.3 Equivalent kernel 等価カーネル2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2
3. 3. Agenda 3.3 Bayesian Linear Regression ベイズ線形回帰– 3.3.1 Parameter distribution パラメータの分布– 3.3.2 Predictive distribution 予測分布– 3.3.3 Equivalent kernel 等価カーネル2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3
4. 4. Bayesian Linear Regression Maximum Likelihood (ML)– The number of basis functions (≃ model complexity)depends on the size of the data set.– Adds the regularization term to control model complexity.– How should we determinethe coefficient of regularization term?2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4
5. 5. Bayesian Linear Regression Maximum Likelihood (ML)– Using ML to determine the coefficient of regularization term... Bad selection• This always leads to excessively complex models (= over-fitting)– Using independent hold-out data to determine model complexity(See §1.3)... Computationally expensive... Wasteful of valuable data2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5In the case of previous slide,λ always becomes 0when using ML to determine λ.
6. 6. Bayesian Linear Regression Bayesian treatment of linear regression– Avoids the over-fitting problem of ML.– Leads to automatic methods of determining model complexityusing the training data alone. What we do?– Introduces the prior distribution and likelihood .• Assumes the model parameter as proberbility function.– Calculates the posterior distributionusing the Bayes theorem:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6
7. 7. Agenda 3.3 Bayesian Linear Regression ベイズ線形回帰– 3.3.1 Parameter distribution パラメータの分布– 3.3.2 Predictive distribution 予測分布– 3.3.3 Equivalent kernel 等価カーネル2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7
8. 8. Note: Marginal / Conditional Gaussians Marginal Gaussian distribution for Conditional Gaussian distribution for given Marginal distribution of Conditional distribution of given2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8Given:Then:where
9. 9. Parameter Distribution Remember the likelihood function given by §3.1.1:– This is the exponential of quadratic function of The corresponding conjugate prior is given bya Gaussian distribution:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9known parameter
10. 10. Parameter Distribution Now given: Then the posterior distribution is shown by using (2.116):where2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10
11. 11. Online Learning- Parameter Distribution If data points arrive sequentially,the design matrix has only 1 row: Assuming that are the n-th input data thenwe can obtain the formula for online learning:whereIn addition,2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11
12. 12. Easy Gaussian Prior- Parameter Distribution If the prior distribution is a zero-mean isotropic Gaussiangoverned by a single precision parameter : The corresponding posterior distribution is also given:where2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12
13. 13. Relationship with MSSE- Parameter Distribution The log of the posterior distribution is given: If prior distribution is given by (3.52), this result is shown:– Maximization of (3.55) with respect to– Minimization of the sum-of-squares error (MSSE) functionwith the addition of a quadratic regularization term2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13Equivalent
14. 14. Example- Parameter Distribution Straight-line fitting– Model function:– True function:– Error:– Goal: To recover the values offrom such data– Prior distribution:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14
15. 15. Generalized Gaussian Prior- Parameter Distribution We can generalize theGaussian prior about exponent. In which correspondsto the Gaussianand only in the case is theprior conjugate to the (3.10).2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15
16. 16. Agenda 3.3 Bayesian Linear Regression ベイズ線形回帰– 3.3.1 Parameter distribution パラメータの分布– 3.3.2 Predictive distribution 予測分布– 3.3.3 Equivalent kernel 等価カーネル2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16
17. 17. Predictive Distribution2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17 Lets consider that making predictions of directlyfor new values of . In order to obtain it, we need to evaluate thepredictive distribution: This formula is tipically written:Marginalization arround(summing out )
18. 18. Predictive Distribution The conditional distribution of the target variable is given: And the posterior weight distribution is given: Accordingly, the result of (3.57) is shown by using (2.115):where2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18
19. 19. Predictive Distribution Now we discuss the variance of predictive distribution:– As additional data points are observed, the posterior distributionbecomes narrower:– 2nd term of the(3.59) goes zero in the limit :2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19Addictive noisegoverened by the parameter .This term depends on the mapping vector. of each data point .
20. 20. Predictive Distribution2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20
21. 21. Example- Predictive Distribution Gaussian regression with sine curve– Basis functions: 9 Gaussian curves2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21Mean of predictive distributionStandard deviation ofpredictive distribution
22. 22. Example- Predictive Distribution Gaussian regression with sine curve2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22
23. 23. Example- Predictive Distribution Gaussian regression with sine curve2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23
24. 24. Problem of Localized Basis- Predictive Distribution Polynominal regression Gaussian regression2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24Which is better?
25. 25. Problem of Localized Basis- Predictive Distribution If we used localized basis function such as Gaussians,then in regions away from the basis function centersthe contribution from the 2nd term in the (3.59) will goes zero. Accordingly, the predictive variance becomes only the noisecontribution . But it is not good result.2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25Large contributionSmall contribution
26. 26. Problem of Localized Basis- Predictive Distribution This problem (arising from choosing localized basis function)can be avoided by adopting an alternative Bayesian approachto regression known as a Gaussian process.– See §6.4.2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26
27. 27. Case of Unknown Precision- Predictive Distribution If both and are treated as unknown thenwe can introduce a conjugate prior distribution andcorresponding posterior distribution as Gaussian-gammadistribution: And then the predictive distribution is given:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27
28. 28. Agenda 3.3 Bayesian Linear Regression ベイズ線形回帰– 3.3.1 Parameter distribution パラメータの分布– 3.3.2 Predictive distribution 予測分布– 3.3.3 Equivalent kernel 等価カーネル2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28
29. 29. Equivalent Kernel2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29 If we substitute the posterior mean solution (3.53) into theexpression (3.3), the predictive mean can be written: This formula can assume the linear combination of :
30. 30. Equivalent Kernel Where the coefficients of each are given: This function is calld smoother matrix or equivalent kernel. Regression functions which make predictions by taking linearcombinations of the training set target values are known aslinear smoothers. We also predict for new input vector using equivalentkernel, instead of calculating parameters of basis functions.2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30
31. 31. Example 1- Equivalent Kernel Equivalent kernel with Gaussian regression Equivalen kernel depends on the set of basis function and thedata set.2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31
32. 32. Equivalent Kernel Equivalent kernel means the contribution of each data pointfor predictive mean. The covariance between and can be shown byequivalent kernel:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32Large contributionSmall contribution
33. 33. Properties of Equivalent Kernel- Equivalent Kernel Equivalent kernel have localization property even if any basisfunctions are not localized. Sum of equivalent kernel equals 1 for all :2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33Polynominal Sigmoid
34. 34. Example 2- Equivalent Kernel Equivalent kernel with polynominal regression– Moving parameter:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34
35. 35. Example 2- Equivalent Kernel Equivalent kernel with polynominal regression– Moving parameter:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35
36. 36. Example 2- Equivalent Kernel Equivalent kernel with polynominal regression– Moving parameter:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36
37. 37. Properties of Equivalent Kernel- Equivalent Kernel Equivalent kernel satisfies an important property shared bykernel functions in general:– Kernel function can be expressed in the form of an inner product withrespect to a vector of nonlinear functions:– In the case of equivalent kernel, is given below:2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37
38. 38. Thank you!2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38zzz...