Your SlideShare is downloading. ×
linear regression part 2
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

linear regression part 2

136

Published on

part 2

part 2

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
136
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 229
  • 2. Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VI Bayesian Regression Sequential Update of the PosteriorLinear Regression 2 Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 198of 229
  • 3. Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National Training Phase University training model with adjustable data x parameter w Bayesian Regression Sequential Update of the training Posterior targets t Predictive Distribution Proof of the Predictive Distribution Predictive Distribution fix the most appropriate w* with Simplified Prior Test Phase Limitations of Linear Basis Function Models test test model with fixed target data x parameter w* t 199of 229
  • 4. Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayes Theorem likelihood × prior p(t | w) p(w) posterior = p(w | t) = normalisation p(t) Bayesian Regression likelihood for i.i.d. data Sequential Update of the Posterior N Predictive Distribution p(t | w) = N (tn | y(xn , w), β −1 ) Proof of the Predictive Distribution n=1 Predictive Distribution N with Simplified Prior T −1 = N (tn | w φ(xn ), β ) Limitations of Linear Basis Function Models n=1 1 = const × exp{−β (t − Φw)T (t − Φw)} 2 where we left out the conditioning on x (always assumed), and β, which is assumed to be constant. 200of 229
  • 5. Introduction to StatisticalHow to choose a prior? Machine Learning c 2011 Christfried Webers NICTA The Australian National University Can we find a prior for the given likelihood which makes sense for the problem at hand Bayesian Regression allows us to find a posterior in a ’nice’ form Sequential Update of the PosteriorAn answer to the second question: Predictive Distribution Proof of the PredictiveDefinition ( Conjugate Prior) Distribution Predictive DistributionA class of prior probability distributions p(w) is conjugate to a with Simplified Priorclass of likelihood functions p(x | w) if the resulting posterior Limitations of Linear Basis Function Modelsdistributions p(w | x) are in the same family as p(w). 201of 229
  • 6. Introduction to StatisticalExamples of Conjugate Prior Distributions Machine Learning c 2011 Christfried Webers NICTA The Australian National University Table: Discrete likelihood distributions Likelihood Conjugate Prior Bernoulli Beta Bayesian Regression Binomial Beta Sequential Update of the Poisson Gamma Posterior Multinomial Dirichlet Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Table: Continuous likelihood distributions Limitations of Linear Basis Function Models Likelihood Conjugate Prior Uniform Pareto Exponential Gamma Normal Normal Multivariate normal Multivariate normal 202of 229
  • 7. Introduction to StatisticalConjugate Prior to a Gaussian Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example : The Gaussian family is conjugate to itself with respect to a Gaussian likelihood function: if the likelihood function is Gaussian, choosing a Gaussian prior will ensure that the posterior distribution is also Gaussian. Bayesian Regression Given a marginal distribution for x and a conditional Sequential Update of the Gaussian distribution for y given x in the form Posterior Predictive Distribution p(x) = N (x | µ, Λ−1 ) Proof of the Predictive Distribution −1 p(y | x) = N (y | Ax + b, L ) Predictive Distribution with Simplified Prior Limitations of Linear we get Basis Function Models p(y) = N (y | Aµ + b, L−1 + AΛ−1 AT ) p(x | y) = N (x | Σ{AT L(y − b) + Λµ}, Σ) where Σ = (Λ + AT L A)−1 . 203of 229
  • 8. Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National Choose a Gaussian prior with mean m0 and covariance S0 University p(w) = N (w | m0 , S0 ) After having seen N training data pairs (xn , tn ), the posterior for the given likelihood is now Bayesian Regression Sequential Update of the Posterior p(w | t) = N (w | mN , SN ) Predictive Distribution Proof of the Predictive where Distribution Predictive Distribution with Simplified Prior mN = SN (S−1 m0 + βΦT t) 0 Limitations of Linear Basis Function Models S−1 = S−1 + βΦT Φ N 0 The posterior is Gaussian, therefore mode = mean. The maximum posterior weight vector wMAP = mN . Assume infinitely broad prior S0 = α−1 I with α → 0, the mean reduces to the maximum likelihood wML . 204of 229
  • 9. Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University If we have not yet seen any data point ( N = 0), the Bayesian Regression posterior is equal to the prior. Sequential Update of the Posterior Sequential arrival of data points : Each posterior Predictive Distribution distribution calculated after the arrival of a data point and Proof of the Predictive target value, acts as the prior distribution for the Distribution Predictive Distribution subsequent data point. with Simplified Prior Nicely fits a sequential learning framework. Limitations of Linear Basis Function Models 205of 229
  • 10. Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Special simplified prior in the remainder, m0 = 0 and S0 = α−1 I, p(x | α) = N (x | 0, α−1 I) (1) The parameters of the posterior distribution p(w | t) = N (w | mn , SN ) are now Bayesian Regression Sequential Update of the Posterior T mN = βSN Φ t Predictive Distribution S−1 = αI + βΦT Φ N Proof of the Predictive Distribution Predictive Distribution For α → 0 we get with Simplified Prior Limitations of Linear Basis Function Models mN → wML = (ΦT Φ)−1 ΦT t Log of posterior is sum of log likelihood and log of prior β α ln p(w | t) = − (t − Φw)T (t − Φw) − wT w + const 2 2 206of 229
  • 11. Introduction to StatisticalBayesian Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Log of posterior is sum of log likelihood and log of prior 1 ln p(w | t) = − β (t − Φw)T (t − Φw) Bayesian Regression 2 Sequential Update of the Posterior sum-of-squares-error α Predictive Distribution − wT w + const Proof of the Predictive 2 Distribution quadr. regulariser Predictive Distribution with Simplified Prior Maximising the posterior distribution with respect to w Limitations of Linear Basis Function Models corresponds to minimising the sum-of-squares error function with the addition of a quadratic regularisation term λ = α/β. 207of 229
  • 12. Introduction to StatisticalSequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example of a linear basis function model Single input x, single output t Bayesian Regression Linear model y(x, w) = w0 + w1 x. Sequential Update of the Posterior Data creation Predictive Distribution 1 Choose an xn from the uniform distribution U(x | − 1, 1). Proof of the Predictive Distribution 2 Calculate f (xn , a) = a0 + a1 xn , where a0 = −0.3, a1 = 0.5. Predictive Distribution 3 Add Gaussian noise with standard deviation σ = 0.2, with Simplified Prior Limitations of Linear tn = N (xn | f (xn , a), 0.04) Basis Function Models Set the precision of the uniform prior to α = 2.0. 208of 229
  • 13. Introduction to StatisticalSequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayesian Regression Sequential Update of the Posterior Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 209of 229
  • 14. Introduction to StatisticalSequential Update of the Posterior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Bayesian Regression Sequential Update of the Posterior Predictive Distribution Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 210of 229
  • 15. Introduction to StatisticalPredictive Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National University In the training phase, data x and targets t are provided Bayesian Regression In the test phase, a new data value x is given and the Sequential Update of the corresponding target value t is asked for Posterior Predictive Distribution Bayesian approach: Find the probability of the test target t Proof of the Predictive given the test data x, the training data x and the training Distribution targets t Predictive Distribution with Simplified Prior p(t | x, x, t) Limitations of Linear Basis Function Models This is the Predictive Distribution. 211of 229
  • 16. Introduction to StatisticalHow to calculate the Predictive Distribution? Machine Learning c 2011 Christfried Webers NICTA Introduce the model parameter w via the sum rule The Australian National University p(t | x, x, t) = p(t, w | x, x, t)dw = p(t | w, x, x, t)p(w | x, x, t)dw Bayesian Regression The test target t depends only on the test data x and the Sequential Update of the Posterior model parameter w, but not on the training data and the Predictive Distribution training targets Proof of the Predictive Distribution p(t | w, x, x, t) = p(t | w, x) Predictive Distribution with Simplified Prior The model parameter w are learned with the training data Limitations of Linear Basis Function Models x and the training targets t only p(w | x, x, t) = p(w | x, t) Predictive Distribution p(t | x, x, t) = p(t | w, x)p(w | x, t)dw 212of 229
  • 17. Introduction to StatisticalProof of the Predictive Distribution Machine Learning c 2011 Christfried Webers NICTA The Australian National How to prove the Predictive Distribution in the general University form? p(t | x, x, t) = p(t | w, x, x, t)p(w | x, x, t)dw Bayesian Regression Convert each conditional probability on the right-hand-side Sequential Update of the into a joint probability. Posterior Predictive Distribution Proof of the Predictive p(t | w, x, x, t)p(w | x, x, t)dw Distribution Predictive Distribution p(t, w, x, x, t) p(w, x, x, t) with Simplified Prior = dw Limitations of Linear p(w, x, x, t) p(x, x, t) Basis Function Models p(t, w, x, x, t) = dw p(x, x, t) p(t, x, x, t) = p(x, x, t) = p(t | x, x, t) 213of 229
  • 18. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA Find the predictive distribution The Australian National University p(t | t, α, β) = p(t | w, β) p(w | t, α, β)dw (remember : The conditioning on the input variables x is often suppressed to simplify the notation.) Bayesian Regression Sequential Update of the Now we know Posterior Predictive Distribution N T −1 Proof of the Predictive p(t | x, w, β) = N (tn | w φ(xn ), β ) Distribution n=1 Predictive Distribution with Simplified Prior and the posterior was Limitations of Linear Basis Function Models p(w | t, α, β) = N (w | mN , SN ) where mN = βSN ΦT t S−1 = αI + βΦT Φ N 214of 229
  • 19. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University If we do the convolution of the two Gaussians, we get for the predictive distribution Bayesian Regression Sequential Update of the Posterior 2 p(t | x, t, α, β) = N (t | mT φ(x), σN (x)) N Predictive Distribution Proof of the Predictive 2 where the variance σN (x) is given by Distribution Predictive Distribution with Simplified Prior 2 1 σN (x) = + φ(x)T SN φ(x). Limitations of Linear β Basis Function Models 215of 229
  • 20. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 1. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 216of 229
  • 21. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 2. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 217of 229
  • 22. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 4. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 218of 229
  • 23. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Number of data points N = 25. 1 Bayesian Regression t Sequential Update of the Posterior Predictive Distribution 0 Proof of the Predictive Distribution Predictive Distribution with Simplified Prior Limitations of Linear −1 Basis Function Models 0 x 1 Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 219of 229
  • 24. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 1. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 220of 229
  • 25. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 2. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 221of 229
  • 26. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 4. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 222of 229
  • 27. Introduction to StatisticalPredictive Distribution with Simplified Prior Machine Learning c 2011 Christfried Webers NICTA The Australian National University Plots of the function y(x, w) using samples from the posterior distribution over w. Number of data points N = 25. Bayesian Regression Sequential Update of the Posterior 1 1 Predictive Distributiont t Proof of the Predictive Distribution 0 0 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models−1 −1 0 x 1 0 x 1 223of 229
  • 28. Introduction to StatisticalLimitations of Linear Basis Function Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Basis function φj (x) are fixed before the training data set is observed. Curse of dimensionality : Number of basis function grows Bayesian Regression rapidly, often exponentially, with the dimensionality D. Sequential Update of the But typical data sets have two nice properties which can Posterior be exploited if the basis functions are not fixed : Predictive Distribution Proof of the Predictive Data lie close to a nonlinear manifold with intrinsic Distribution dimension much smaller than D. Need algorithms which Predictive Distribution with Simplified Prior place basis functions only where data are (e.g. radial basis Limitations of Linear function networks, support vector machines, relevance Basis Function Models vector machines, neural networks). Target variables may only depend on a few significant directions within the data manifold. Need algorithms which can exploit this property (Neural networks). 224of 229
  • 29. Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National University Linear Algebra allows us to operate in n-dimensional vector spaces using the intution from our 3-dimensional world as a vector space. No surprises as long as n is finite. If we add more structure to a vector space (e.g. inner Bayesian Regression product, metric), our intution gained from the Sequential Update of the Posterior 3-dimensional world around us may be wrong. Predictive Distribution Example: Sphere of radius r = 1. What is the fraction of Proof of the Predictive Distribution the volume of the sphere in a D-dimensional space which Predictive Distribution lies between radius r = 1 and r = 1 − ? with Simplified Prior Volume scales like rD , therefore the formula for the volume Limitations of Linear Basis Function Models of a sphere is VD (r) = KD rD . VD (1) − VD (1 − ) = 1 − (1 − )D VD (1) 225of 229
  • 30. Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National Fraction of the volume of the sphere in a D-dimensional University space which lies between radius r = 1 and r = 1 − VD (1) − VD (1 − ) = 1 − (1 − )D VD (1) Bayesian Regression Sequential Update of the 1 Posterior D = 20 Predictive Distribution 0.8 D =5 Proof of the Predictive Distribution D =2 volume fraction Predictive Distribution 0.6 with Simplified Prior D =1 Limitations of Linear Basis Function Models 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 226of 229
  • 31. Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA The Australian National University Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D. 2 Bayesian Regression D=1 Sequential Update of the Posterior Predictive Distribution D=2 p(r) Proof of the Predictive 1 Distribution D = 20 Predictive Distribution with Simplified Prior Limitations of Linear Basis Function Models 0 0 2 4 r 227of 229
  • 32. Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA Probability density with respect to radius r of a Gaussian The Australian National University distribution for various values of the dimensionality D. Example: D = 2; assume µ = 0, Σ = I 1 1 1 1 2 2 N (x | 0, I) = exp − xT x = exp − (x1 + x2 ) 2π 2 2π 2 Bayesian Regression Sequential Update of the Posterior Coordinate transformation Predictive Distribution Proof of the Predictive x1 = r cos(φ) x2 = r sin(φ) Distribution Predictive Distribution Probability in the new coordinates with Simplified Prior Limitations of Linear Basis Function Models p(r, φ | 0, I) = N (r(x), φ(x) | 0, I) | J | where | J | = r is the determinant of the Jacobian for the given coordinate transformation. 1 1 p(r, φ | 0, I) = r exp − r2 2π 2 228of 229
  • 33. Introduction to StatisticalCurse of Dimensionality Machine Learning c 2011 Christfried Webers NICTA Probability density with respect to radius r of a Gaussian The Australian National University distribution for D = 2 (and µ = 0, Σ = I) 1 1 p(r, φ | 0, I) = r exp − r2 2π 2 Bayesian Regression Integrate over all angles φ Sequential Update of the Posterior 2π 1 1 1 Predictive Distribution p(r | 0, I) = r exp − r2 dφ = r exp − r2 Proof of the Predictive 0 2π 2 2 Distribution Predictive Distribution with Simplified Prior 2 Limitations of Linear D=1 Basis Function Models D=2 p(r) 1 D = 20 0 0 2 4 r 229of 229

×