1.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many ﬁgures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 300
2.
Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VIII Probabilistic Generative Models Continuous InputLinear Classiﬁcation 2 Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression 268of 300
3.
Introduction to StatisticalThree Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIn increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Probabilistic Generative 1 Solve the inference problem of determining the posterior Models class probabilities p(Ck | x). Continuous Input 2 Use decision theory to assign each new x to one of the Discrete Features classes. Probabilistic Discriminative Models Generative Models Logistic Regression 1 Solve the inference problem of determining the Iterative Reweighted Least Squares class-conditional probabilities p(x | Ck ). Laplace Approximation 2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic 3 Use Bayes’ theorem to ﬁnd the posterior p(Ck | x). Regression 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 269of 300
4.
Introduction to StatisticalProbabilistic Generative Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Generative approach: model class-conditional densities p(x | Ck ) and priors p(Ck ) to calculate the posterior probability for class C1 p(x | C1 )p(C1 ) Probabilistic Generative p(C1 | x) = Models p(x | C1 )p(C1 ) + p(x | C2 )p(C2 ) Continuous Input 1 Discrete Features = = σ(a(x)) Probabilistic 1 + exp(−a(x)) Discriminative Models Logistic Regression where a and the logistic sigmoid function σ(a) are given by Iterative Reweighted Least Squares p(x | C1 ) p(C1 ) p(x, C1 ) Laplace Approximation a(x) = ln = ln Bayesian Logistic p(x | C2 ) p(C2 ) p(x, C2 ) Regression 1 σ(a) = . 1 + exp(−a) 270of 300
5.
Introduction to StatisticalLogistic Sigmoid Machine Learning c 2011 Christfried Webers NICTA The Australian National 1 The logistic sigmoid function σ(a) = 1+exp(−a) University "squashing function’ because it maps the real axis into a ﬁnite interval (0, 1) σ(−a) = 1 − σ(a) d da σ(a) = σ(a) σ(−a) = σ(a) (1 − σ(a)) Probabilistic Generative Derivative Models σ Continuous Input Inverse is called logit function a(σ) = ln 1−σ Discrete Features Probabilistic Discriminative Models Logistic Regression Σa aΣ 1.0 Iterative Reweighted Least Squares 4 0.8 Laplace Approximation 2 0.6 Bayesian Logistic Regression Σ 0.2 0.4 0.6 0.8 1.0 0.4 2 0.2 4 a 6 10 5 5 10 Logistic Sigmoid σ(a) Logit a(σ) 271of 300
6.
Introduction to StatisticalProbabilistic Generative Models - Multiclass Machine Learning c 2011 Christfried Webers NICTA The Australian National University The normalised exponential is given by p(x | Ck ) p(Ck ) exp(ak ) Probabilistic Generative p(Ck | x) = = Models j p(x | Cj ) p(Cj ) j exp(aj ) Continuous Input Discrete Features where Probabilistic ak = ln(p(x | Ck ) p(Ck )). Discriminative Models Logistic Regression Also called softmax function as it is a smoothed version of Iterative Reweighted the max function. Least Squares Laplace Approximation Example: If ak aj for all j = k, then p(Ck | x) 1, and Bayesian Logistic p(Cj | x) 0. Regression 272of 300
7.
Introduction to StatisticalProbabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume class-conditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? Probabilistic Generative Models 1 1 1 p(x | Ck ) = D/2 |Σ|1/2 exp − (x − µk )T Σ−1 (x − µk ) Continuous Input (2π) 2 Discrete Features Probabilistic 1 1 1 T −1 Discriminative Models = exp − x Σ x (2π)D/2 |Σ|1/2 2 Logistic Regression Iterative Reweighted 1 T −1 × exp µT Σ−1 x − µk Σ µk Least Squares k 2 Laplace Approximation Bayesian Logistic Regression where we separated the quadratic term in x and the linear term. 273of 300
8.
Introduction to StatisticalProbabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National For two classes University p(C1 | x) = σ(a(x)) and a(x) is p(x | C1 ) p(C1 ) Probabilistic Generative a(x) = ln Models p(x | C2 ) p(C2 ) Continuous Input exp µT Σ−1 x − 2 µT Σ−1 µ1 1 1 1 p(C1 ) Discrete Features = ln + ln exp µT Σ−1 x − 1 µT Σ−1 µ2 2 2 2 p(C2 ) Probabilistic Discriminative Models Logistic Regression Therefore Iterative Reweighted p(C1 | x) = σ(wT x + w0 ) Least Squares Laplace Approximation where Bayesian Logistic Regression w = Σ−1 (µ1 − µ2 ) 1 1 p(C1 ) w0 = − µT Σ−1 µ1 + µT Σ−1 µ2 + ln 1 2 2 2 p(C2 ) 274of 300
9.
Introduction to StatisticalProbabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National University Class-conditional densities for two classes (left). Posterior probability p(C1 | x) (right). Note the logistic sigmoid of a linear function of x. Probabilistic Generative Models Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression 275of 300
10.
Introduction to StatisticalGeneral Case - K Classes, Shared Covariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University Use the normalised exponential p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = = j p(x | Cj )p(Cj ) j exp(aj ) Probabilistic Generative Models where Continuous Input ak = ln (p(x | Ck )p(Ck )) . Discrete Features to get a linear function of x Probabilistic Discriminative Models Logistic Regression ak (x) = wT x + wk0 . k Iterative Reweighted Least Squares where Laplace Approximation Bayesian Logistic Regression wk = Σ−1 µk 1 wk0 = − µT Σ−1 µk + p(Ck ). 2 k 276of 300
11.
Introduction to StatisticalGeneral Case - K Classes, Different Covariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University If each class-conditional probability has a different covariance, the quadratic terms − 1 xT Σ−1 x do not longer 2 cancel each other out. We get a quadratic discriminant. Probabilistic Generative Models Continuous Input 2.5 Discrete Features 2 Probabilistic Discriminative Models 1.5 Logistic Regression 1 Iterative Reweighted 0.5 Least Squares 0 Laplace Approximation −0.5 Bayesian Logistic −1 Regression −1.5 −2 −2.5 −2 −1 0 1 2 277of 300
12.
Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given the functional form of the class-conditional densities p(x | Ck ), can we determine the parameters µ and Σ ? Not without data ;-) Given also a data set (xn , tn ) for n = 1, . . . , N. (Using the Probabilistic Generative Models coding scheme where tn = 1 corresponds to class C1 and Continuous Input tn = 0 denotes class C2 . Discrete Features Assume the class-conditional densities to be Gaussian Probabilistic Discriminative Models with the same covariance, but different mean. Logistic Regression Denote the prior probability p(C1 ) = π, and therefore Iterative Reweighted p(C2 ) = 1 − π. Least Squares Laplace Approximation Then Bayesian Logistic Regression p(xn , C1 ) = p(C1 )p(xn | C1 ) = π N (xn | µ1 , Σ) p(xn , C2 ) = p(C2 )p(xn | C2 ) = (1 − π) N (xn | µ2 , Σ) 278of 300
13.
Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA Thus the likelihood for the whole data set X and t is given The Australian National University by N p(t, X | π, µ1 , µ2 , Σ) = [π N (xn | µ1 , Σ)]tn × n=1 Probabilistic Generative [(1 − π) N (xn | µ2 , Σ)]1−tn Models Continuous Input Maximise the log likelihood Discrete Features The term depending on π is Probabilistic Discriminative Models Logistic Regression N Iterative Reweighted (tn ln π + (1 − tn ) ln(1 − π) Least Squares n=1 Laplace Approximation Bayesian Logistic which is maximal for Regression N 1 N1 N1 π= tn = = N N N1 + N2 n=1 where N1 is the number of data points in class C1 . 279of 300
14.
Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X | π, µ1 , µ2 , Σ) ) depending on the mean µ1 or µ2 , and get Probabilistic Generative Models N Continuous Input 1 µ1 = tn x n Discrete Features N1 Probabilistic n=1 Discriminative Models N 1 Logistic Regression µ2 = (1 − tn ) xn Iterative Reweighted N2 Least Squares n=1 Laplace Approximation For each class, this are the means of all input vectors Bayesian Logistic Regression assigned to this class. 280of 300
15.
Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Finally, the log likelihood ln p(t, X | π, µ1 , µ2 , Σ) can be Probabilistic Generative maximised for the covariance Σ resulting in Models Continuous Input N1 N2 Discrete Features Σ= S1 + S2 N N Probabilistic Discriminative Models 1 Sk = (xn − µk )(xn − µk )T Logistic Regression Nk Iterative Reweighted n∈Ck Least Squares Laplace Approximation Bayesian Logistic Regression 281of 300
16.
Introduction to StatisticalDiscrete Features - Naive Bayes Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume the input space consists of discrete features, in the simplest case xi ∈ {0, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2D entries. Probabilistic Generative Models Together with the normalisation constraint, this are 2D − 1 Continuous Input Discrete Features independent variables. Probabilistic Grows exponentially with the number of features. Discriminative Models Logistic Regression The Naive Bayes assumption is that all features Iterative Reweighted conditioned on the class Ck are independent of each other. Least Squares Laplace Approximation D Bayesian Logistic p(x | Ck ) = µxii (1 − µki )1−xi k Regression i=1 282of 300
17.
Introduction to StatisticalDiscrete Features - Naive Bayes Machine Learning c 2011 Christfried Webers NICTA The Australian National University With the naive Bayes D p(x | Ck ) = µxii (1 − µki )1−xi k i=1 Probabilistic Generative Models we can then again ﬁnd the factors ak in the normalised Continuous Input exponential Discrete Features Probabilistic Discriminative Models p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = = Logistic Regression j p(x | Cj )p(Cj ) j exp(aj ) Iterative Reweighted Least Squares Laplace Approximation as a linear function of the xi Bayesian Logistic Regression D ak (x) = {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ). i=1 283of 300
18.
Introduction to StatisticalThree Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIn increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Probabilistic Generative 1 Solve the inference problem of determining the posterior Models class probabilities p(Ck | x). Continuous Input 2 Use decision theory to assign each new x to one of the Discrete Features classes. Probabilistic Discriminative Models Generative Models Logistic Regression 1 Solve the inference problem of determining the Iterative Reweighted Least Squares class-conditional probabilities p(x | Ck ). Laplace Approximation 2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic 3 Use Bayes’ theorem to ﬁnd the posterior p(Ck | x). Regression 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 284of 300
19.
Introduction to StatisticalProbabilistic Discriminative Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Maximise a likelihood function deﬁned through the conditional distribution p(Ck | x) directly. Probabilistic Generative Discriminative training Models Continuous Input Typically fewer parameters to be determined. Discrete Features As we learn the posteriror p(Ck | x) directly, prediction may Probabilistic Discriminative Models be better than with a generative model where the Logistic Regression class-conditional density assumptions p(x | Ck ) poorly Iterative Reweighted approximate the true distributions. Least Squares Laplace Approximation But: discriminative models can not create synthetic data, Bayesian Logistic as p(x) is not modelled. Regression 285of 300
20.
Introduction to StatisticalOriginal Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA Used direct input x until now. The Australian National University All classiﬁcation algorithms work also if we ﬁrst apply a ﬁxed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at Probabilistic Generative the green crosses in the input space. Models Continuous Input Discrete Features Probabilistic Discriminative Models 1 Logistic Regression 1 Iterative Reweighted φ2 Least Squares x2 Laplace Approximation 0 0.5 Bayesian Logistic Regression −1 0 −1 0 x1 1 0 0.5 φ1 1 286of 300
21.
Introduction to StatisticalOriginal Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA The Australian National Linear decision boundaries in the feature space University correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. Probabilistic Generative Models BUT: If classes overlap in input space, they will also Continuous Input overlap in feature space. Discrete Features Nonlinear features φ(x) can not remove the overlap; but Probabilistic Discriminative Models they may increase it ! Logistic Regression Iterative Reweighted Least Squares 1 1 Laplace Approximation φ2 Bayesian Logistic x2 Regression 0 0.5 −1 0 −1 0 x1 1 0 0.5 φ1 1 287of 300
22.
Introduction to StatisticalOriginal Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA The Australian National University Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Probabilistic Generative Models Linear Regression). Continuous Input Understanding of more advanced algorithms becomes Discrete Features easier if we introduce the feature space now and use it Probabilistic Discriminative Models instead of the original input space. Logistic Regression Some applications use ﬁxed features successfully by Iterative Reweighted avoiding the limitations. Least Squares Laplace Approximation We will therefore use φ instead of x from now on. Bayesian Logistic Regression 288of 300
23.
Introduction to StatisticalLogistic Regression is Classiﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two classes where the posterior of class C1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(C1 | φ) = y(φ) = σ(wT φ) Probabilistic Generative Models p(C2 | φ) = 1 − p(C1 | φ) Continuous Input Model dimension is equal to dimension of the feature Discrete Features Probabilistic space M. Discriminative Models Compare this to ﬁtting two Gaussians Logistic Regression Iterative Reweighted Least Squares 2M + M(M + 1)/2 = M(M + 5)/2 Laplace Approximation means shared covariance Bayesian Logistic Regression For larger M, the logistic regression model has a clear advantage. 289of 300
24.
Introduction to StatisticalLogistic Regression is Classiﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Determine the parameter via maximum likelihood for data (φn , tn ), n = 1, . . . , N, where φn = φ(xn ). The class membership is coded as tn ∈ {0, 1}. Likelihood function Probabilistic Generative Models N Continuous Input p(t | w) = ytnn (1 − yn )1−tn Discrete Features n=1 Probabilistic Discriminative Models where yn = p(C1 | φn ). Logistic Regression Iterative Reweighted Error function : negative log likelihood resulting in the Least Squares cross-entropy error function Laplace Approximation Bayesian Logistic Regression N E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 290of 300
25.
Introduction to StatisticalLogistic Regression is Classiﬁcation Machine Learning c 2011 Christfried Webers NICTA The Australian National Error function (cross-entropy error ) University N E(w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 Probabilistic Generative yn = p(C1 | φn ) = σ(wT φn ) Models Continuous Input dσ Gradient of the error function (using da = σ(1 − σ) ) Discrete Features Probabilistic N Discriminative Models E(w) = (yn − tn )φn Logistic Regression n=1 Iterative Reweighted Least Squares Laplace Approximation gradient does not contain any sigmoid function Bayesian Logistic for each data point error is product of deviation yn − tn and Regression basis function φn . BUT : maximum likelihood solution can exhibit over-ﬁtting even for many data points; should use regularised error or MAP then. 291of 300
26.
Introduction to StatisticalLaplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x) ? Need to ﬁnd a mode of p(x). Try to ﬁnd a Gaussian with the same mode. Probabilistic Generative Models Continuous Input 0.8 40 Discrete Features Probabilistic 0.6 30 Discriminative Models Logistic Regression 0.4 20 Iterative Reweighted Least Squares 0.2 10 Laplace Approximation 0 0 Bayesian Logistic −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 Regression Non-Gaussian (yellow) and Negative log of theGaussian approximation (red). Non-Gaussian (yellow) and Gaussian approx. (red). 292of 300
27.
Introduction to StatisticalLaplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume p(x) can be written as 1 p(z) = f (z) Z Probabilistic Generative Models with normalisation Z = f (z) dz. Continuous Input Furthermore, assume Z is unknown ! Discrete Features A mode of p(z) is at a point z0 where p (z0 ) = 0. Probabilistic Discriminative Models Taylor expansion of ln f (z) at z0 Logistic Regression Iterative Reweighted Least Squares 1 ln f (z) ln f (z0 ) − A(z − z0 )2 Laplace Approximation 2 Bayesian Logistic Regression where d2 A=− ln f (z) |z=z0 dz2 293of 300
28.
Introduction to StatisticalLaplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Exponentiating 1 ln f (z) ln f (z0 ) − A(z − z0 )2 2 Probabilistic Generative Models we get Continuous Input A f (z) f (z0 ) exp{− (z − z0 )2 }. Discrete Features 2 Probabilistic Discriminative Models And after normalisation we get the Laplace approximation Logistic Regression Iterative Reweighted 1/2 Least Squares A A q(z) = exp{− (z − z0 )2 }. Laplace Approximation 2π 2 Bayesian Logistic Regression Only deﬁned for precision A > 0 as only then p(z) has a maximum. 294of 300
29.
Introduction to StatisticalLaplace Approximation - Vector Space Machine Learning c 2011 Christfried Webers NICTA The Australian National Approximate p(z) for z ∈ RM University 1 p(z) = f (z). Z we get the Taylor expansion Probabilistic Generative Models 1 Continuous Input ln f (z) ln f (z0 ) − (z − z0 )T A(z − z0 ) 2 Discrete Features Probabilistic Discriminative Models where the Hessian A is deﬁned as Logistic Regression Iterative Reweighted A=− ln f (z) |z=z0 . Least Squares Laplace Approximation The Laplace approximation of p(z) is then Bayesian Logistic Regression |A|1/2 1 q(z) = exp − (z − z0 )T A(z − z0 ) (2π)M/2 2 = N (z | z0 , A−1 ) 295of 300
30.
Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Exact Bayesian inference for the logistic regression is Probabilistic Generative intractable. Models Continuous Input Why? Need to normalise a product of prior probabilities Discrete Features and likelihoods which itself are a product of logistic Probabilistic sigmoid functions, one for each data point. Discriminative Models Logistic Regression Evaluation of the predictive distribution also intractable. Iterative Reweighted Least Squares Therefore we will use the Laplace approximation. Laplace Approximation Bayesian Logistic Regression 296of 300
31.
Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w | m0 , S0 ) Probabilistic Generative Models for ﬁxed hyperparameter m0 and S0 . Continuous Input Hyperparameters are parameters of a prior distribution. In Discrete Features contrast to the model parameters w, they are not learned. Probabilistic Discriminative Models For a set of training data (xn , tn ), where n = 1, . . . , N, the Logistic Regression posterior is given by Iterative Reweighted Least Squares Laplace Approximation p(w | t) ∝ p(w)p(t | w) Bayesian Logistic Regression where t = (t1 , . . . , tN )T . 297of 300
32.
Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Using our previous result for the cross-entropy function N E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 Probabilistic Generative Models we can now calculate the log of the posterior Continuous Input Discrete Features p(w | t) ∝ p(w)p(t | w) Probabilistic Discriminative Models Logistic Regression using the notation yn = σ(wT φn ) as Iterative Reweighted Least Squares 1 Laplace Approximation ln p(w | t) = − (w − m0 )T S−1 (w − m0 ) 0 2 Bayesian Logistic Regression N + {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 298of 300
33.
Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University To obtain a Gaussian approximation to 1 ln p(w | t) = − (w − m0 )T S−1 (w − m0 ) 0 2 N Probabilistic Generative Models + {tn ln yn + (1 − tn ) ln(1 − yn )} Continuous Input n=1 Discrete Features Probabilistic Discriminative Models 1 Find wMAP which maximises ln p(w | t). This deﬁnes the Logistic Regression mean of the Gaussian approximation. (Note: This is a Iterative Reweighted nonlinear function in w because yn = σ(wT φn ).) Least Squares 2 Calculate the second derivative of the negative log likelihood Laplace Approximation to get the inverse covariance of the Laplace approximation Bayesian Logistic Regression N SN = − ln p(w | t) = S−1 + 0 yn (1 − yn )φn φT . n n=1 299of 300
34.
Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University The approximated Gaussian (via Laplace approximation) of the posterior distribution is now Probabilistic Generative Models Continuous Input q(w | φ) = N (w | wMAP , SN ) Discrete Features Probabilistic where Discriminative Models Logistic Regression N S−1 Iterative Reweighted SN = − ln p(w | t) = 0 + yn (1 − yn )φn φT . n Least Squares n=1 Laplace Approximation Bayesian Logistic Regression 300of 300
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment