• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
08 linear classification_2
 

08 linear classification_2

on

  • 204 views

08 linear classification_2

08 linear classification_2

Statistics

Views

Total Views
204
Views on SlideShare
203
Embed Views
1

Actions

Likes
0
Downloads
5
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    08 linear classification_2 08 linear classification_2 Presentation Transcript

    • Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIntroduction to Statistical Machine Learning Christfried Webers Statistical Machine Learning Group NICTA and College of Engineering and Computer Science The Australian National University Canberra February – June 2011 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning") 1of 300
    • Introduction to Statistical Machine Learning c 2011 Christfried Webers NICTA The Australian National University Part VIII Probabilistic Generative Models Continuous InputLinear Classification 2 Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression 268of 300
    • Introduction to StatisticalThree Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIn increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Probabilistic Generative 1 Solve the inference problem of determining the posterior Models class probabilities p(Ck | x). Continuous Input 2 Use decision theory to assign each new x to one of the Discrete Features classes. Probabilistic Discriminative Models Generative Models Logistic Regression 1 Solve the inference problem of determining the Iterative Reweighted Least Squares class-conditional probabilities p(x | Ck ). Laplace Approximation 2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic 3 Use Bayes’ theorem to find the posterior p(Ck | x). Regression 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 269of 300
    • Introduction to StatisticalProbabilistic Generative Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Generative approach: model class-conditional densities p(x | Ck ) and priors p(Ck ) to calculate the posterior probability for class C1 p(x | C1 )p(C1 ) Probabilistic Generative p(C1 | x) = Models p(x | C1 )p(C1 ) + p(x | C2 )p(C2 ) Continuous Input 1 Discrete Features = = σ(a(x)) Probabilistic 1 + exp(−a(x)) Discriminative Models Logistic Regression where a and the logistic sigmoid function σ(a) are given by Iterative Reweighted Least Squares p(x | C1 ) p(C1 ) p(x, C1 ) Laplace Approximation a(x) = ln = ln Bayesian Logistic p(x | C2 ) p(C2 ) p(x, C2 ) Regression 1 σ(a) = . 1 + exp(−a) 270of 300
    • Introduction to StatisticalLogistic Sigmoid Machine Learning c 2011 Christfried Webers NICTA The Australian National 1 The logistic sigmoid function σ(a) = 1+exp(−a) University "squashing function’ because it maps the real axis into a finite interval (0, 1) σ(−a) = 1 − σ(a) d da σ(a) = σ(a) σ(−a) = σ(a) (1 − σ(a)) Probabilistic Generative Derivative Models σ Continuous Input Inverse is called logit function a(σ) = ln 1−σ Discrete Features Probabilistic Discriminative Models Logistic Regression Σa aΣ 1.0 Iterative Reweighted Least Squares 4 0.8 Laplace Approximation 2 0.6 Bayesian Logistic Regression Σ 0.2 0.4 0.6 0.8 1.0 0.4 2 0.2 4 a 6 10 5 5 10 Logistic Sigmoid σ(a) Logit a(σ) 271of 300
    • Introduction to StatisticalProbabilistic Generative Models - Multiclass Machine Learning c 2011 Christfried Webers NICTA The Australian National University The normalised exponential is given by p(x | Ck ) p(Ck ) exp(ak ) Probabilistic Generative p(Ck | x) = = Models j p(x | Cj ) p(Cj ) j exp(aj ) Continuous Input Discrete Features where Probabilistic ak = ln(p(x | Ck ) p(Ck )). Discriminative Models Logistic Regression Also called softmax function as it is a smoothed version of Iterative Reweighted the max function. Least Squares Laplace Approximation Example: If ak aj for all j = k, then p(Ck | x) 1, and Bayesian Logistic p(Cj | x) 0. Regression 272of 300
    • Introduction to StatisticalProbabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume class-conditional probabilities are Gaussian, all classes share the same covariance. What can we say about the posterior probabilities? Probabilistic Generative Models 1 1 1 p(x | Ck ) = D/2 |Σ|1/2 exp − (x − µk )T Σ−1 (x − µk ) Continuous Input (2π) 2 Discrete Features Probabilistic 1 1 1 T −1 Discriminative Models = exp − x Σ x (2π)D/2 |Σ|1/2 2 Logistic Regression Iterative Reweighted 1 T −1 × exp µT Σ−1 x − µk Σ µk Least Squares k 2 Laplace Approximation Bayesian Logistic Regression where we separated the quadratic term in x and the linear term. 273of 300
    • Introduction to StatisticalProbabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National For two classes University p(C1 | x) = σ(a(x)) and a(x) is p(x | C1 ) p(C1 ) Probabilistic Generative a(x) = ln Models p(x | C2 ) p(C2 ) Continuous Input exp µT Σ−1 x − 2 µT Σ−1 µ1 1 1 1 p(C1 ) Discrete Features = ln + ln exp µT Σ−1 x − 1 µT Σ−1 µ2 2 2 2 p(C2 ) Probabilistic Discriminative Models Logistic Regression Therefore Iterative Reweighted p(C1 | x) = σ(wT x + w0 ) Least Squares Laplace Approximation where Bayesian Logistic Regression w = Σ−1 (µ1 − µ2 ) 1 1 p(C1 ) w0 = − µT Σ−1 µ1 + µT Σ−1 µ2 + ln 1 2 2 2 p(C2 ) 274of 300
    • Introduction to StatisticalProbabil. Generative Model - Continuous Input Machine Learning c 2011 Christfried Webers NICTA The Australian National University Class-conditional densities for two classes (left). Posterior probability p(C1 | x) (right). Note the logistic sigmoid of a linear function of x. Probabilistic Generative Models Continuous Input Discrete Features Probabilistic Discriminative Models Logistic Regression Iterative Reweighted Least Squares Laplace Approximation Bayesian Logistic Regression 275of 300
    • Introduction to StatisticalGeneral Case - K Classes, Shared Covariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University Use the normalised exponential p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = = j p(x | Cj )p(Cj ) j exp(aj ) Probabilistic Generative Models where Continuous Input ak = ln (p(x | Ck )p(Ck )) . Discrete Features to get a linear function of x Probabilistic Discriminative Models Logistic Regression ak (x) = wT x + wk0 . k Iterative Reweighted Least Squares where Laplace Approximation Bayesian Logistic Regression wk = Σ−1 µk 1 wk0 = − µT Σ−1 µk + p(Ck ). 2 k 276of 300
    • Introduction to StatisticalGeneral Case - K Classes, Different Covariance Machine Learning c 2011 Christfried Webers NICTA The Australian National University If each class-conditional probability has a different covariance, the quadratic terms − 1 xT Σ−1 x do not longer 2 cancel each other out. We get a quadratic discriminant. Probabilistic Generative Models Continuous Input 2.5 Discrete Features 2 Probabilistic Discriminative Models 1.5 Logistic Regression 1 Iterative Reweighted 0.5 Least Squares 0 Laplace Approximation −0.5 Bayesian Logistic −1 Regression −1.5 −2 −2.5 −2 −1 0 1 2 277of 300
    • Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given the functional form of the class-conditional densities p(x | Ck ), can we determine the parameters µ and Σ ? Not without data ;-) Given also a data set (xn , tn ) for n = 1, . . . , N. (Using the Probabilistic Generative Models coding scheme where tn = 1 corresponds to class C1 and Continuous Input tn = 0 denotes class C2 . Discrete Features Assume the class-conditional densities to be Gaussian Probabilistic Discriminative Models with the same covariance, but different mean. Logistic Regression Denote the prior probability p(C1 ) = π, and therefore Iterative Reweighted p(C2 ) = 1 − π. Least Squares Laplace Approximation Then Bayesian Logistic Regression p(xn , C1 ) = p(C1 )p(xn | C1 ) = π N (xn | µ1 , Σ) p(xn , C2 ) = p(C2 )p(xn | C2 ) = (1 − π) N (xn | µ2 , Σ) 278of 300
    • Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA Thus the likelihood for the whole data set X and t is given The Australian National University by N p(t, X | π, µ1 , µ2 , Σ) = [π N (xn | µ1 , Σ)]tn × n=1 Probabilistic Generative [(1 − π) N (xn | µ2 , Σ)]1−tn Models Continuous Input Maximise the log likelihood Discrete Features The term depending on π is Probabilistic Discriminative Models Logistic Regression N Iterative Reweighted (tn ln π + (1 − tn ) ln(1 − π) Least Squares n=1 Laplace Approximation Bayesian Logistic which is maximal for Regression N 1 N1 N1 π= tn = = N N N1 + N2 n=1 where N1 is the number of data points in class C1 . 279of 300
    • Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Similarly, we can maximise the log likelihood (and thereby the likelihood p(t, X | π, µ1 , µ2 , Σ) ) depending on the mean µ1 or µ2 , and get Probabilistic Generative Models N Continuous Input 1 µ1 = tn x n Discrete Features N1 Probabilistic n=1 Discriminative Models N 1 Logistic Regression µ2 = (1 − tn ) xn Iterative Reweighted N2 Least Squares n=1 Laplace Approximation For each class, this are the means of all input vectors Bayesian Logistic Regression assigned to this class. 280of 300
    • Introduction to StatisticalMaximum Likelihood Solution Machine Learning c 2011 Christfried Webers NICTA The Australian National University Finally, the log likelihood ln p(t, X | π, µ1 , µ2 , Σ) can be Probabilistic Generative maximised for the covariance Σ resulting in Models Continuous Input N1 N2 Discrete Features Σ= S1 + S2 N N Probabilistic Discriminative Models 1 Sk = (xn − µk )(xn − µk )T Logistic Regression Nk Iterative Reweighted n∈Ck Least Squares Laplace Approximation Bayesian Logistic Regression 281of 300
    • Introduction to StatisticalDiscrete Features - Naive Bayes Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume the input space consists of discrete features, in the simplest case xi ∈ {0, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2D entries. Probabilistic Generative Models Together with the normalisation constraint, this are 2D − 1 Continuous Input Discrete Features independent variables. Probabilistic Grows exponentially with the number of features. Discriminative Models Logistic Regression The Naive Bayes assumption is that all features Iterative Reweighted conditioned on the class Ck are independent of each other. Least Squares Laplace Approximation D Bayesian Logistic p(x | Ck ) = µxii (1 − µki )1−xi k Regression i=1 282of 300
    • Introduction to StatisticalDiscrete Features - Naive Bayes Machine Learning c 2011 Christfried Webers NICTA The Australian National University With the naive Bayes D p(x | Ck ) = µxii (1 − µki )1−xi k i=1 Probabilistic Generative Models we can then again find the factors ak in the normalised Continuous Input exponential Discrete Features Probabilistic Discriminative Models p(x | Ck )p(Ck ) exp(ak ) p(Ck | x) = = Logistic Regression j p(x | Cj )p(Cj ) j exp(aj ) Iterative Reweighted Least Squares Laplace Approximation as a linear function of the xi Bayesian Logistic Regression D ak (x) = {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ). i=1 283of 300
    • Introduction to StatisticalThree Models for Decision Problems Machine Learning c 2011 Christfried Webers NICTA The Australian National UniversityIn increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models Probabilistic Generative 1 Solve the inference problem of determining the posterior Models class probabilities p(Ck | x). Continuous Input 2 Use decision theory to assign each new x to one of the Discrete Features classes. Probabilistic Discriminative Models Generative Models Logistic Regression 1 Solve the inference problem of determining the Iterative Reweighted Least Squares class-conditional probabilities p(x | Ck ). Laplace Approximation 2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic 3 Use Bayes’ theorem to find the posterior p(Ck | x). Regression 4 Alternatively, model the joint distribution p(x, Ck ) directly. 5 Use decision theory to assign each new x to one of the classes. 284of 300
    • Introduction to StatisticalProbabilistic Discriminative Models Machine Learning c 2011 Christfried Webers NICTA The Australian National University Maximise a likelihood function defined through the conditional distribution p(Ck | x) directly. Probabilistic Generative Discriminative training Models Continuous Input Typically fewer parameters to be determined. Discrete Features As we learn the posteriror p(Ck | x) directly, prediction may Probabilistic Discriminative Models be better than with a generative model where the Logistic Regression class-conditional density assumptions p(x | Ck ) poorly Iterative Reweighted approximate the true distributions. Least Squares Laplace Approximation But: discriminative models can not create synthetic data, Bayesian Logistic as p(x) is not modelled. Regression 285of 300
    • Introduction to StatisticalOriginal Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA Used direct input x until now. The Australian National University All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at Probabilistic Generative the green crosses in the input space. Models Continuous Input Discrete Features Probabilistic Discriminative Models 1 Logistic Regression 1 Iterative Reweighted φ2 Least Squares x2 Laplace Approximation 0 0.5 Bayesian Logistic Regression −1 0 −1 0 x1 1 0 0.5 φ1 1 286of 300
    • Introduction to StatisticalOriginal Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA The Australian National Linear decision boundaries in the feature space University correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. Probabilistic Generative Models BUT: If classes overlap in input space, they will also Continuous Input overlap in feature space. Discrete Features Nonlinear features φ(x) can not remove the overlap; but Probabilistic Discriminative Models they may increase it ! Logistic Regression Iterative Reweighted Least Squares 1 1 Laplace Approximation φ2 Bayesian Logistic x2 Regression 0 0.5 −1 0 −1 0 x1 1 0 0.5 φ1 1 287of 300
    • Introduction to StatisticalOriginal Input versus Feature Space Machine Learning c 2011 Christfried Webers NICTA The Australian National University Fixed basis functions do not adapt to the data and therefore have important limitations (see discussion in Probabilistic Generative Models Linear Regression). Continuous Input Understanding of more advanced algorithms becomes Discrete Features easier if we introduce the feature space now and use it Probabilistic Discriminative Models instead of the original input space. Logistic Regression Some applications use fixed features successfully by Iterative Reweighted avoiding the limitations. Least Squares Laplace Approximation We will therefore use φ instead of x from now on. Bayesian Logistic Regression 288of 300
    • Introduction to StatisticalLogistic Regression is Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National University Two classes where the posterior of class C1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(C1 | φ) = y(φ) = σ(wT φ) Probabilistic Generative Models p(C2 | φ) = 1 − p(C1 | φ) Continuous Input Model dimension is equal to dimension of the feature Discrete Features Probabilistic space M. Discriminative Models Compare this to fitting two Gaussians Logistic Regression Iterative Reweighted Least Squares 2M + M(M + 1)/2 = M(M + 5)/2 Laplace Approximation means shared covariance Bayesian Logistic Regression For larger M, the logistic regression model has a clear advantage. 289of 300
    • Introduction to StatisticalLogistic Regression is Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National University Determine the parameter via maximum likelihood for data (φn , tn ), n = 1, . . . , N, where φn = φ(xn ). The class membership is coded as tn ∈ {0, 1}. Likelihood function Probabilistic Generative Models N Continuous Input p(t | w) = ytnn (1 − yn )1−tn Discrete Features n=1 Probabilistic Discriminative Models where yn = p(C1 | φn ). Logistic Regression Iterative Reweighted Error function : negative log likelihood resulting in the Least Squares cross-entropy error function Laplace Approximation Bayesian Logistic Regression N E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 290of 300
    • Introduction to StatisticalLogistic Regression is Classification Machine Learning c 2011 Christfried Webers NICTA The Australian National Error function (cross-entropy error ) University N E(w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 Probabilistic Generative yn = p(C1 | φn ) = σ(wT φn ) Models Continuous Input dσ Gradient of the error function (using da = σ(1 − σ) ) Discrete Features Probabilistic N Discriminative Models E(w) = (yn − tn )φn Logistic Regression n=1 Iterative Reweighted Least Squares Laplace Approximation gradient does not contain any sigmoid function Bayesian Logistic for each data point error is product of deviation yn − tn and Regression basis function φn . BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. 291of 300
    • Introduction to StatisticalLaplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Given a continous distribution p(x) which is not Gaussian, can we approximate it by a Gaussian q(x) ? Need to find a mode of p(x). Try to find a Gaussian with the same mode. Probabilistic Generative Models Continuous Input 0.8 40 Discrete Features Probabilistic 0.6 30 Discriminative Models Logistic Regression 0.4 20 Iterative Reweighted Least Squares 0.2 10 Laplace Approximation 0 0 Bayesian Logistic −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 Regression Non-Gaussian (yellow) and Negative log of theGaussian approximation (red). Non-Gaussian (yellow) and Gaussian approx. (red). 292of 300
    • Introduction to StatisticalLaplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume p(x) can be written as 1 p(z) = f (z) Z Probabilistic Generative Models with normalisation Z = f (z) dz. Continuous Input Furthermore, assume Z is unknown ! Discrete Features A mode of p(z) is at a point z0 where p (z0 ) = 0. Probabilistic Discriminative Models Taylor expansion of ln f (z) at z0 Logistic Regression Iterative Reweighted Least Squares 1 ln f (z) ln f (z0 ) − A(z − z0 )2 Laplace Approximation 2 Bayesian Logistic Regression where d2 A=− ln f (z) |z=z0 dz2 293of 300
    • Introduction to StatisticalLaplace Approximation Machine Learning c 2011 Christfried Webers NICTA The Australian National University Exponentiating 1 ln f (z) ln f (z0 ) − A(z − z0 )2 2 Probabilistic Generative Models we get Continuous Input A f (z) f (z0 ) exp{− (z − z0 )2 }. Discrete Features 2 Probabilistic Discriminative Models And after normalisation we get the Laplace approximation Logistic Regression Iterative Reweighted 1/2 Least Squares A A q(z) = exp{− (z − z0 )2 }. Laplace Approximation 2π 2 Bayesian Logistic Regression Only defined for precision A > 0 as only then p(z) has a maximum. 294of 300
    • Introduction to StatisticalLaplace Approximation - Vector Space Machine Learning c 2011 Christfried Webers NICTA The Australian National Approximate p(z) for z ∈ RM University 1 p(z) = f (z). Z we get the Taylor expansion Probabilistic Generative Models 1 Continuous Input ln f (z) ln f (z0 ) − (z − z0 )T A(z − z0 ) 2 Discrete Features Probabilistic Discriminative Models where the Hessian A is defined as Logistic Regression Iterative Reweighted A=− ln f (z) |z=z0 . Least Squares Laplace Approximation The Laplace approximation of p(z) is then Bayesian Logistic Regression |A|1/2 1 q(z) = exp − (z − z0 )T A(z − z0 ) (2π)M/2 2 = N (z | z0 , A−1 ) 295of 300
    • Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Exact Bayesian inference for the logistic regression is Probabilistic Generative intractable. Models Continuous Input Why? Need to normalise a product of prior probabilities Discrete Features and likelihoods which itself are a product of logistic Probabilistic sigmoid functions, one for each data point. Discriminative Models Logistic Regression Evaluation of the predictive distribution also intractable. Iterative Reweighted Least Squares Therefore we will use the Laplace approximation. Laplace Approximation Bayesian Logistic Regression 296of 300
    • Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Assume a Gaussian prior because we want a Gaussian posterior. p(w) = N (w | m0 , S0 ) Probabilistic Generative Models for fixed hyperparameter m0 and S0 . Continuous Input Hyperparameters are parameters of a prior distribution. In Discrete Features contrast to the model parameters w, they are not learned. Probabilistic Discriminative Models For a set of training data (xn , tn ), where n = 1, . . . , N, the Logistic Regression posterior is given by Iterative Reweighted Least Squares Laplace Approximation p(w | t) ∝ p(w)p(t | w) Bayesian Logistic Regression where t = (t1 , . . . , tN )T . 297of 300
    • Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University Using our previous result for the cross-entropy function N E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 Probabilistic Generative Models we can now calculate the log of the posterior Continuous Input Discrete Features p(w | t) ∝ p(w)p(t | w) Probabilistic Discriminative Models Logistic Regression using the notation yn = σ(wT φn ) as Iterative Reweighted Least Squares 1 Laplace Approximation ln p(w | t) = − (w − m0 )T S−1 (w − m0 ) 0 2 Bayesian Logistic Regression N + {tn ln yn + (1 − tn ) ln(1 − yn )} n=1 298of 300
    • Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University To obtain a Gaussian approximation to 1 ln p(w | t) = − (w − m0 )T S−1 (w − m0 ) 0 2 N Probabilistic Generative Models + {tn ln yn + (1 − tn ) ln(1 − yn )} Continuous Input n=1 Discrete Features Probabilistic Discriminative Models 1 Find wMAP which maximises ln p(w | t). This defines the Logistic Regression mean of the Gaussian approximation. (Note: This is a Iterative Reweighted nonlinear function in w because yn = σ(wT φn ).) Least Squares 2 Calculate the second derivative of the negative log likelihood Laplace Approximation to get the inverse covariance of the Laplace approximation Bayesian Logistic Regression N SN = − ln p(w | t) = S−1 + 0 yn (1 − yn )φn φT . n n=1 299of 300
    • Introduction to StatisticalBayesian Logistic Regression Machine Learning c 2011 Christfried Webers NICTA The Australian National University The approximated Gaussian (via Laplace approximation) of the posterior distribution is now Probabilistic Generative Models Continuous Input q(w | φ) = N (w | wMAP , SN ) Discrete Features Probabilistic where Discriminative Models Logistic Regression N S−1 Iterative Reweighted SN = − ln p(w | t) = 0 + yn (1 − yn )φn φT . n Least Squares n=1 Laplace Approximation Bayesian Logistic Regression 300of 300