1.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 229
2.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VI Bayesian Regression Sequential Update of the PosteriorLinear Regression 2 Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models 198of 229
3.
Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National Training Phase University training model with adjustable data x parameter w Bayesian Regression Sequential Update of the training Posterior targets t Predictive Distribution Proof of the Predictive Distribution Predictive Distribution ﬁx the most appropriate w* with Simpliﬁed Prior Test Phase Limitations of Linear Basis Function Models test test model with ﬁxed target data x parameter w* t 199of 229
4.
Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayes Theorem likelihood × prior p(t | w) p(w) posterior = p(w | t) = normalisation p(t) Bayesian Regression likelihood for i.i.d. data Sequential Update of the Posterior N Predictive Distribution p(t | w) = N (tn | y(xn , w), β −1 ) Proof of the Predictive Distribution n=1 Predictive Distribution N with Simpliﬁed Prior T −1 = N (tn | w φ(xn ), β ) Limitations of Linear Basis Function Models n=1 1 = const × exp{−β (t − Φw)T (t − Φw)} 2 where we left out the conditioning on x (always assumed), and β, which is assumed to be constant. 200of 229
5.
Introduction to StatisticalHow to choose a prior? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Can we ﬁnd a prior for the given likelihood which makes sense for the problem at hand Bayesian Regression allows us to ﬁnd a posterior in a ’nice’ form Sequential Update of the PosteriorAn answer to the second question: Predictive Distribution Proof of the PredictiveDeﬁnition ( Conjugate Prior) Distribution Predictive DistributionA class of prior probability distributions p(w) is conjugate to a with Simpliﬁed Priorclass of likelihood functions p(x | w) if the resulting posterior Limitations of Linear Basis Function Modelsdistributions p(w | x) are in the same family as p(w). 201of 229
6.
Introduction to StatisticalExamples of Conjugate Prior Distributions Machine Learning c 2011 Christfried Webers NICTA The Australian National University Table: Discrete likelihood distributions Likelihood Conjugate Prior Bernoulli Beta Bayesian Regression Binomial Beta Sequential Update of the Poisson Gamma Posterior Multinomial Dirichlet Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Table: Continuous likelihood distributions Limitations of Linear Basis Function Models Likelihood Conjugate Prior Uniform Pareto Exponential Gamma Normal Normal Multivariate normal Multivariate normal 202of 229
7.
Introduction to StatisticalConjugate Prior to a Gaussian Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example : The Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior will ensure that the posterior distribution is also Gaussian. Bayesian Regression Given a marginal distribution for x and a conditional Sequential Update of the Gaussian distribution for y given x in the form Posterior Predictive Distribution p(x) = N (x | µ, Λ−1 ) Proof of the Predictive Distribution −1 p(y | x) = N (y | Ax + b, L ) Predictive Distribution with Simpliﬁed Prior Limitations of Linear we get Basis Function Models p(y) = N (y | Aµ + b, L−1 + AΛ−1 AT ) p(x | y) = N (x | Σ{AT L(y − b) + Λµ}, Σ) where Σ = (Λ + AT L A)−1 . 203of 229
8.
Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National Choose a Gaussian prior with mean m0 and covariance S0 University p(w) = N (w | m0 , S0 ) After having seen N training data pairs (xn , tn ), the posterior for the given likelihood is now Bayesian Regression Sequential Update of the Posterior p(w | t) = N (w | mN , SN ) Predictive Distribution Proof of the Predictive where Distribution Predictive Distribution with Simpliﬁed Prior mN = SN (S−1 m0 + βΦT t) 0 Limitations of Linear Basis Function Models S−1 = S−1 + βΦT Φ N 0 The posterior is Gaussian, therefore mode = mean. The maximum posterior weight vector wMAP = mN . Assume inﬁnitely broad prior S0 = α−1 I with α → 0, the mean reduces to the maximum likelihood wML . 204of 229
9.
Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University If we have not yet seen any data point ( N = 0), the Bayesian Regression posterior is equal to the prior. Sequential Update of the Posterior Sequential arrival of data points : Each posterior Predictive Distribution distribution calculated after the arrival of a data point and Proof of the Predictive target value, acts as the prior distribution for the Distribution Predictive Distribution subsequent data point. with Simpliﬁed Prior Nicely ﬁts a sequential learning framework. Limitations of Linear Basis Function Models 205of 229
10.
Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Special simpliﬁed prior in the remainder, m0 = 0 and S0 = α−1 I, p(x | α) = N (x | 0, α−1 I) (1) The parameters of the posterior distribution p(w | t) = N (w | mn , SN ) are now Bayesian Regression Sequential Update of the Posterior T mN = βSN Φ t Predictive Distribution S−1 = αI + βΦT Φ N Proof of the Predictive Distribution Predictive Distribution For α → 0 we get with Simpliﬁed Prior Limitations of Linear Basis Function Models mN → wML = (ΦT Φ)−1 ΦT t Log of posterior is sum of log likelihood and log of prior β α ln p(w | t) = − (t − Φw)T (t − Φw) − wT w + const 2 2 206of 229
11.
Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Log of posterior is sum of log likelihood and log of prior 1 ln p(w | t) = − β (t − Φw)T (t − Φw) Bayesian Regression 2 Sequential Update of the Posterior sum-of-squares-error α Predictive Distribution − wT w + const Proof of the Predictive 2 Distribution quadr. regulariser Predictive Distribution with Simpliﬁed Prior Maximising the posterior distribution with respect to w Limitations of Linear Basis Function Models corresponds to minimising the sum-of-squares error function with the addition of a quadratic regularisation term λ = α/β. 207of 229
12.
Introduction to StatisticalSequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example of a linear basis function model Single input x, single output t Bayesian Regression Linear model y(x, w) = w0 + w1 x. Sequential Update of the Posterior Data creation Predictive Distribution 1 Choose an xn from the uniform distribution U(x | − 1, 1). Proof of the Predictive Distribution 2 Calculate f (xn , a) = a0 + a1 xn , where a0 = −0.3, a1 = 0.5. Predictive Distribution 3 Add Gaussian noise with standard deviation σ = 0.2, with Simpliﬁed Prior Limitations of Linear tn = N (xn | f (xn , a), 0.04) Basis Function Models Set the precision of the uniform prior to α = 2.0. 208of 229
13.
Introduction to StatisticalSequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayesian Regression Sequential Update of the Posterior Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models 209of 229
14.
Introduction to StatisticalSequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayesian Regression Sequential Update of the Posterior Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models 210of 229
15.
Introduction to StatisticalPredictive Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University In the training phase, data x and targets t are provided Bayesian Regression In the test phase, a new data value x is given and the Sequential Update of the corresponding target value t is asked for Posterior Predictive Distribution Bayesian approach: Find the probability of the test target t Proof of the Predictive given the test data x, the training data x and the training Distribution targets t Predictive Distribution with Simpliﬁed Prior p(t | x, x, t) Limitations of Linear Basis Function Models This is the Predictive Distribution. 211of 229
16.
Introduction to StatisticalHow to calculate the Predictive Distribution? Machine Learning c 2011 Christfried Webers NICTA Introduce the model parameter w via the sum rule The Australian National University p(t | x, x, t) = p(t, w | x, x, t)dw = p(t | w, x, x, t)p(w | x, x, t)dw Bayesian Regression The test target t depends only on the test data x and the Sequential Update of the Posterior model parameter w, but not on the training data and the Predictive Distribution training targets Proof of the Predictive Distribution p(t | w, x, x, t) = p(t | w, x) Predictive Distribution with Simpliﬁed Prior The model parameter w are learned with the training data Limitations of Linear Basis Function Models x and the training targets t only p(w | x, x, t) = p(w | x, t) Predictive Distribution p(t | x, x, t) = p(t | w, x)p(w | x, t)dw 212of 229
17.
Introduction to StatisticalProof of the Predictive Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National How to prove the Predictive Distribution in the general University form? p(t | x, x, t) = p(t | w, x, x, t)p(w | x, x, t)dw Bayesian Regression Convert each conditional probability on the right-hand-side Sequential Update of the into a joint probability. Posterior Predictive Distribution Proof of the Predictive p(t | w, x, x, t)p(w | x, x, t)dw Distribution Predictive Distribution p(t, w, x, x, t) p(w, x, x, t) with Simpliﬁed Prior = dw Limitations of Linear p(w, x, x, t) p(x, x, t) Basis Function Models p(t, w, x, x, t) = dw p(x, x, t) p(t, x, x, t) = p(x, x, t) = p(t | x, x, t) 213of 229
18.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA Find the predictive distribution The Australian National University p(t | t, α, β) = p(t | w, β) p(w | t, α, β)dw (remember : The conditioning on the input variables x is often suppressed to simplify the notation.) Bayesian Regression Sequential Update of the Now we know Posterior Predictive Distribution N T −1 Proof of the Predictive p(t | x, w, β) = N (tn | w φ(xn ), β ) Distribution n=1 Predictive Distribution with Simpliﬁed Prior and the posterior was Limitations of Linear Basis Function Models p(w | t, α, β) = N (w | mN , SN ) where mN = βSN ΦT t S−1 = αI + βΦT Φ N 214of 229
19.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University If we do the convolution of the two Gaussians, we get for the predictive distribution Bayesian Regression Sequential Update of the Posterior 2 p(t | x, t, α, β) = N (t | mT φ(x), σN (x)) N Predictive Distribution Proof of the Predictive 2 where the variance σN (x) is given by Distribution Predictive Distribution with Simpliﬁed Prior 2 1 σN (x) = + φ(x)T SN φ(x). Limitations of Linear β Basis Function Models 215of 229
20.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artiﬁcial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 1. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 216of 229
21.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artiﬁcial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 2. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 217of 229
22.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artiﬁcial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 4. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 218of 229
23.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artiﬁcial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 25. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simpliﬁed Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 219of 229
24.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 1. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 220of 229
25.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 2. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 221of 229
26.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 4. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 222of 229
27.
Introduction to StatisticalPredictive Distribution with Simpliﬁed Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 25. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 223of 229
28.
Introduction to StatisticalLimitations of Linear Basis Function Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Basis function φj (x) are ﬁxed before the training data set is observed. Curse of dimensionality : Number of basis function grows Bayesian Regression rapidly, often exponentially, with the dimensionality D. Sequential Update of the But typical data sets have two nice properties which can Posterior be exploited if the basis functions are not ﬁxed : Predictive Distribution Proof of the Predictive Data lie close to a nonlinear manifold with intrinsic Distribution dimension much smaller than D. Need algorithms which Predictive Distribution with Simpliﬁed Prior place basis functions only where data are (e.g. radial basis Limitations of Linear function networks, support vector machines, relevance Basis Function Models vector machines, neural networks). Target variables may only depend on a few signiﬁcant directions within the data manifold. Need algorithms which can exploit this property (Neural networks). 224of 229
29.
Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National University Linear Algebra allows us to operate in n-dimensional vector spaces using the intution from our 3-dimensional world as a vector space. No surprises as long as n is ﬁnite. If we add more structure to a vector space (e.g. inner Bayesian Regression product, metric), our intution gained from the Sequential Update of the Posterior 3-dimensional world around us may be wrong. Predictive Distribution Example: Sphere of radius r = 1. What is the fraction of Proof of the Predictive Distribution the volume of the sphere in a D-dimensional space which Predictive Distribution lies between radius r = 1 and r = 1 − ? with Simpliﬁed Prior Volume scales like rD , therefore the formula for the volume Limitations of Linear Basis Function Models of a sphere is VD (r) = KD rD . VD (1) − VD (1 − ) = 1 − (1 − )D VD (1) 225of 229
30.
Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National Fraction of the volume of the sphere in a D-dimensional University space which lies between radius r = 1 and r = 1 − VD (1) − VD (1 − ) = 1 − (1 − )D VD (1) Bayesian Regression Sequential Update of the 1 Posterior D = 20 Predictive Distribution 0.8 D =5 Proof of the Predictive Distribution D =2 volume fraction Predictive Distribution 0.6 with Simpliﬁed Prior D =1 Limitations of Linear Basis Function Models 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 226of 229
31.
Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National University Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D. 2 Bayesian Regression D=1 Sequential Update of the Posterior Predictive Distribution D=2 p(r) Proof of the Predictive 1 Distribution D = 20 Predictive Distribution with Simpliﬁed Prior Limitations of Linear Basis Function Models 0 0 2 4 r 227of 229
32.
Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA Probability density with respect to radius r of a Gaussian The Australian National University distribution for various values of the dimensionality D. Example: D = 2; assume µ = 0, Σ = I 1 1 1 1 2 2 N (x | 0, I) = exp − xT x = exp − (x1 + x2 ) 2π 2 2π 2 Bayesian Regression Sequential Update of the Posterior Coordinate transformation Predictive Distribution Proof of the Predictive x1 = r cos(φ) x2 = r sin(φ) Distribution Predictive Distribution Probability in the new coordinates with Simpliﬁed Prior Limitations of Linear Basis Function Models p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J | where | J | = r is the determinant of the Jacobian for the given coordinate transformation. 1 1 p(r, φ | 0, I) = r exp − r2 2π 2 228of 229
33.
Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA Probability density with respect to radius r of a Gaussian The Australian National University distribution for D = 2 (and µ = 0, Σ = I) 1 1 p(r, φ | 0, I) = r exp − r2 2π 2 Bayesian Regression Integrate over all angles φ Sequential Update of the Posterior 2π 1 1 1 Predictive Distribution p(r | 0, I) = r exp − r2 dφ = r exp − r2 Proof of the Predictive 0 2π 2 2 Distribution Predictive Distribution with Simpliﬁed Prior 2 Limitations of Linear D=1 Basis Function Models D=2 p(r) 1 D = 20 0 0 2 4 r 229of 229
Be the first to comment