1. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Introduction to Statistical Machine Learning
Christfried Webers
Statistical Machine Learning Group
NICTA
and
College of Engineering and Computer Science
The Australian National University
Canberra
February – June 2011
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
1of 300
2. Introduction to Statistical
Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Part VIII
Probabilistic Generative
Models
Continuous Input
Linear Classification 2 Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
268of 300
3. Introduction to Statistical
Three Models for Decision Problems Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models Probabilistic Generative
1 Solve the inference problem of determining the posterior Models
class probabilities p(Ck | x). Continuous Input
2 Use decision theory to assign each new x to one of the Discrete Features
classes. Probabilistic
Discriminative Models
Generative Models Logistic Regression
1 Solve the inference problem of determining the Iterative Reweighted
Least Squares
class-conditional probabilities p(x | Ck ). Laplace Approximation
2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic
3 Use Bayes’ theorem to find the posterior p(Ck | x). Regression
4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.
269of 300
4. Introduction to Statistical
Probabilistic Generative Models Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Generative approach: model class-conditional densities
p(x | Ck ) and priors p(Ck ) to calculate the posterior
probability for class C1
p(x | C1 )p(C1 ) Probabilistic Generative
p(C1 | x) = Models
p(x | C1 )p(C1 ) + p(x | C2 )p(C2 ) Continuous Input
1 Discrete Features
= = σ(a(x)) Probabilistic
1 + exp(−a(x)) Discriminative Models
Logistic Regression
where a and the logistic sigmoid function σ(a) are given by Iterative Reweighted
Least Squares
p(x | C1 ) p(C1 ) p(x, C1 ) Laplace Approximation
a(x) = ln = ln Bayesian Logistic
p(x | C2 ) p(C2 ) p(x, C2 ) Regression
1
σ(a) = .
1 + exp(−a)
270of 300
5. Introduction to Statistical
Logistic Sigmoid Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
1
The logistic sigmoid function σ(a) = 1+exp(−a)
University
"squashing function’ because it maps the real axis into a
finite interval (0, 1)
σ(−a) = 1 − σ(a)
d
da σ(a) = σ(a) σ(−a) = σ(a) (1 − σ(a)) Probabilistic Generative
Derivative Models
σ Continuous Input
Inverse is called logit function a(σ) = ln 1−σ Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Σa
aΣ
1.0 Iterative Reweighted
Least Squares
4
0.8
Laplace Approximation
2
0.6 Bayesian Logistic
Regression
Σ
0.2 0.4 0.6 0.8 1.0
0.4
2
0.2
4
a 6
10 5 5 10
Logistic Sigmoid σ(a) Logit a(σ)
271of 300
6. Introduction to Statistical
Probabilistic Generative Models - Multiclass Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The normalised exponential is given by
p(x | Ck ) p(Ck ) exp(ak ) Probabilistic Generative
p(Ck | x) = = Models
j p(x | Cj ) p(Cj ) j exp(aj ) Continuous Input
Discrete Features
where Probabilistic
ak = ln(p(x | Ck ) p(Ck )). Discriminative Models
Logistic Regression
Also called softmax function as it is a smoothed version of Iterative Reweighted
the max function. Least Squares
Laplace Approximation
Example: If ak aj for all j = k, then p(Ck | x) 1, and Bayesian Logistic
p(Cj | x) 0. Regression
272of 300
7. Introduction to Statistical
Probabil. Generative Model - Continuous Input Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Assume class-conditional probabilities are Gaussian, all
classes share the same covariance. What can we say
about the posterior probabilities?
Probabilistic Generative
Models
1 1 1
p(x | Ck ) = D/2 |Σ|1/2
exp − (x − µk )T Σ−1 (x − µk ) Continuous Input
(2π) 2 Discrete Features
Probabilistic
1 1 1 T −1 Discriminative Models
= exp − x Σ x
(2π)D/2 |Σ|1/2 2 Logistic Regression
Iterative Reweighted
1 T −1
× exp µT Σ−1 x − µk Σ µk
Least Squares
k
2 Laplace Approximation
Bayesian Logistic
Regression
where we separated the quadratic term in x and the linear
term.
273of 300
8. Introduction to Statistical
Probabil. Generative Model - Continuous Input Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
For two classes University
p(C1 | x) = σ(a(x))
and a(x) is
p(x | C1 ) p(C1 ) Probabilistic Generative
a(x) = ln Models
p(x | C2 ) p(C2 )
Continuous Input
exp µT Σ−1 x − 2 µT Σ−1 µ1
1
1
1 p(C1 ) Discrete Features
= ln + ln
exp µT Σ−1 x − 1 µT Σ−1 µ2
2 2 2
p(C2 ) Probabilistic
Discriminative Models
Logistic Regression
Therefore Iterative Reweighted
p(C1 | x) = σ(wT x + w0 ) Least Squares
Laplace Approximation
where Bayesian Logistic
Regression
w = Σ−1 (µ1 − µ2 )
1 1 p(C1 )
w0 = − µT Σ−1 µ1 + µT Σ−1 µ2 + ln
1 2
2 2 p(C2 )
274of 300
9. Introduction to Statistical
Probabil. Generative Model - Continuous Input Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Class-conditional densities for two classes (left). Posterior
probability p(C1 | x) (right). Note the logistic sigmoid of a linear
function of x. Probabilistic Generative
Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
Logistic Regression
Iterative Reweighted
Least Squares
Laplace Approximation
Bayesian Logistic
Regression
275of 300
10. Introduction to Statistical
General Case - K Classes, Shared Covariance Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Use the normalised exponential
p(x | Ck )p(Ck ) exp(ak )
p(Ck | x) = =
j p(x | Cj )p(Cj ) j exp(aj )
Probabilistic Generative
Models
where
Continuous Input
ak = ln (p(x | Ck )p(Ck )) . Discrete Features
to get a linear function of x Probabilistic
Discriminative Models
Logistic Regression
ak (x) = wT x + wk0 .
k Iterative Reweighted
Least Squares
where Laplace Approximation
Bayesian Logistic
Regression
wk = Σ−1 µk
1
wk0 = − µT Σ−1 µk + p(Ck ).
2 k
276of 300
11. Introduction to Statistical
General Case - K Classes, Different Covariance Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
If each class-conditional probability has a different
covariance, the quadratic terms − 1 xT Σ−1 x do not longer
2
cancel each other out.
We get a quadratic discriminant.
Probabilistic Generative
Models
Continuous Input
2.5 Discrete Features
2 Probabilistic
Discriminative Models
1.5
Logistic Regression
1
Iterative Reweighted
0.5 Least Squares
0 Laplace Approximation
−0.5
Bayesian Logistic
−1 Regression
−1.5
−2
−2.5
−2 −1 0 1 2
277of 300
12. Introduction to Statistical
Maximum Likelihood Solution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given the functional form of the class-conditional densities
p(x | Ck ), can we determine the parameters µ and Σ ?
Not without data ;-)
Given also a data set (xn , tn ) for n = 1, . . . , N. (Using the Probabilistic Generative
Models
coding scheme where tn = 1 corresponds to class C1 and
Continuous Input
tn = 0 denotes class C2 .
Discrete Features
Assume the class-conditional densities to be Gaussian Probabilistic
Discriminative Models
with the same covariance, but different mean.
Logistic Regression
Denote the prior probability p(C1 ) = π, and therefore Iterative Reweighted
p(C2 ) = 1 − π. Least Squares
Laplace Approximation
Then Bayesian Logistic
Regression
p(xn , C1 ) = p(C1 )p(xn | C1 ) = π N (xn | µ1 , Σ)
p(xn , C2 ) = p(C2 )p(xn | C2 ) = (1 − π) N (xn | µ2 , Σ)
278of 300
13. Introduction to Statistical
Maximum Likelihood Solution Machine Learning
c 2011
Christfried Webers
NICTA
Thus the likelihood for the whole data set X and t is given The Australian National
University
by
N
p(t, X | π, µ1 , µ2 , Σ) = [π N (xn | µ1 , Σ)]tn ×
n=1
Probabilistic Generative
[(1 − π) N (xn | µ2 , Σ)]1−tn Models
Continuous Input
Maximise the log likelihood Discrete Features
The term depending on π is Probabilistic
Discriminative Models
Logistic Regression
N
Iterative Reweighted
(tn ln π + (1 − tn ) ln(1 − π) Least Squares
n=1 Laplace Approximation
Bayesian Logistic
which is maximal for Regression
N
1 N1 N1
π= tn = =
N N N1 + N2
n=1
where N1 is the number of data points in class C1 .
279of 300
14. Introduction to Statistical
Maximum Likelihood Solution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Similarly, we can maximise the log likelihood (and thereby
the likelihood p(t, X | π, µ1 , µ2 , Σ) ) depending on the mean
µ1 or µ2 , and get Probabilistic Generative
Models
N Continuous Input
1
µ1 = tn x n Discrete Features
N1 Probabilistic
n=1 Discriminative Models
N
1 Logistic Regression
µ2 = (1 − tn ) xn Iterative Reweighted
N2 Least Squares
n=1
Laplace Approximation
For each class, this are the means of all input vectors Bayesian Logistic
Regression
assigned to this class.
280of 300
15. Introduction to Statistical
Maximum Likelihood Solution Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Finally, the log likelihood ln p(t, X | π, µ1 , µ2 , Σ) can be Probabilistic Generative
maximised for the covariance Σ resulting in Models
Continuous Input
N1 N2 Discrete Features
Σ= S1 + S2
N N Probabilistic
Discriminative Models
1
Sk = (xn − µk )(xn − µk )T Logistic Regression
Nk Iterative Reweighted
n∈Ck Least Squares
Laplace Approximation
Bayesian Logistic
Regression
281of 300
16. Introduction to Statistical
Discrete Features - Naive Bayes Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Assume the input space consists of discrete features, in
the simplest case xi ∈ {0, 1}.
For a D-dimensional input space, a general distribution
would be represented by a table with 2D entries. Probabilistic Generative
Models
Together with the normalisation constraint, this are 2D − 1 Continuous Input
Discrete Features
independent variables.
Probabilistic
Grows exponentially with the number of features. Discriminative Models
Logistic Regression
The Naive Bayes assumption is that all features Iterative Reweighted
conditioned on the class Ck are independent of each other. Least Squares
Laplace Approximation
D Bayesian Logistic
p(x | Ck ) = µxii (1 − µki )1−xi
k
Regression
i=1
282of 300
17. Introduction to Statistical
Discrete Features - Naive Bayes Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
With the naive Bayes
D
p(x | Ck ) = µxii (1 − µki )1−xi
k
i=1 Probabilistic Generative
Models
we can then again find the factors ak in the normalised Continuous Input
exponential Discrete Features
Probabilistic
Discriminative Models
p(x | Ck )p(Ck ) exp(ak )
p(Ck | x) = = Logistic Regression
j p(x | Cj )p(Cj ) j exp(aj ) Iterative Reweighted
Least Squares
Laplace Approximation
as a linear function of the xi
Bayesian Logistic
Regression
D
ak (x) = {xi ln µki + (1 − xi ) ln(1 − µki )} + ln p(Ck ).
i=1
283of 300
18. Introduction to Statistical
Three Models for Decision Problems Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
In increasing order of complexity
Find a discriminant function f (x) which maps each input
directly onto a class label.
Discriminative Models Probabilistic Generative
1 Solve the inference problem of determining the posterior Models
class probabilities p(Ck | x). Continuous Input
2 Use decision theory to assign each new x to one of the Discrete Features
classes. Probabilistic
Discriminative Models
Generative Models Logistic Regression
1 Solve the inference problem of determining the Iterative Reweighted
Least Squares
class-conditional probabilities p(x | Ck ). Laplace Approximation
2 Also, infer the prior class probabilities p(Ck ). Bayesian Logistic
3 Use Bayes’ theorem to find the posterior p(Ck | x). Regression
4 Alternatively, model the joint distribution p(x, Ck ) directly.
5 Use decision theory to assign each new x to one of the
classes.
284of 300
19. Introduction to Statistical
Probabilistic Discriminative Models Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Maximise a likelihood function defined through the
conditional distribution p(Ck | x) directly.
Probabilistic Generative
Discriminative training Models
Continuous Input
Typically fewer parameters to be determined.
Discrete Features
As we learn the posteriror p(Ck | x) directly, prediction may Probabilistic
Discriminative Models
be better than with a generative model where the
Logistic Regression
class-conditional density assumptions p(x | Ck ) poorly
Iterative Reweighted
approximate the true distributions. Least Squares
Laplace Approximation
But: discriminative models can not create synthetic data,
Bayesian Logistic
as p(x) is not modelled. Regression
285of 300
20. Introduction to Statistical
Original Input versus Feature Space Machine Learning
c 2011
Christfried Webers
NICTA
Used direct input x until now. The Australian National
University
All classification algorithms work also if we first apply a
fixed nonlinear transformation of the inputs using a vector
of basis functions φ(x).
Example: Use two Gaussian basis functions centered at
Probabilistic Generative
the green crosses in the input space. Models
Continuous Input
Discrete Features
Probabilistic
Discriminative Models
1
Logistic Regression
1
Iterative Reweighted
φ2 Least Squares
x2
Laplace Approximation
0 0.5 Bayesian Logistic
Regression
−1
0
−1 0 x1 1 0 0.5 φ1 1
286of 300
21. Introduction to Statistical
Original Input versus Feature Space Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Linear decision boundaries in the feature space University
correspond to nonlinear decision boundaries in the input
space.
Classes which are NOT linearly separable in the input
space can become linearly separable in the feature space. Probabilistic Generative
Models
BUT: If classes overlap in input space, they will also Continuous Input
overlap in feature space. Discrete Features
Nonlinear features φ(x) can not remove the overlap; but Probabilistic
Discriminative Models
they may increase it ! Logistic Regression
Iterative Reweighted
Least Squares
1
1
Laplace Approximation
φ2 Bayesian Logistic
x2
Regression
0 0.5
−1
0
−1 0 x1 1 0 0.5 φ1 1
287of 300
22. Introduction to Statistical
Original Input versus Feature Space Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Fixed basis functions do not adapt to the data and
therefore have important limitations (see discussion in Probabilistic Generative
Models
Linear Regression).
Continuous Input
Understanding of more advanced algorithms becomes Discrete Features
easier if we introduce the feature space now and use it Probabilistic
Discriminative Models
instead of the original input space.
Logistic Regression
Some applications use fixed features successfully by Iterative Reweighted
avoiding the limitations. Least Squares
Laplace Approximation
We will therefore use φ instead of x from now on. Bayesian Logistic
Regression
288of 300
23. Introduction to Statistical
Logistic Regression is Classification Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Two classes where the posterior of class C1 is a logistic
sigmoid σ() acting on a linear function of the feature vector
φ
p(C1 | φ) = y(φ) = σ(wT φ) Probabilistic Generative
Models
p(C2 | φ) = 1 − p(C1 | φ) Continuous Input
Model dimension is equal to dimension of the feature Discrete Features
Probabilistic
space M. Discriminative Models
Compare this to fitting two Gaussians Logistic Regression
Iterative Reweighted
Least Squares
2M + M(M + 1)/2 = M(M + 5)/2 Laplace Approximation
means shared covariance Bayesian Logistic
Regression
For larger M, the logistic regression model has a clear
advantage.
289of 300
24. Introduction to Statistical
Logistic Regression is Classification Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Determine the parameter via maximum likelihood for data
(φn , tn ), n = 1, . . . , N, where φn = φ(xn ). The class
membership is coded as tn ∈ {0, 1}.
Likelihood function Probabilistic Generative
Models
N Continuous Input
p(t | w) = ytnn (1 − yn )1−tn Discrete Features
n=1 Probabilistic
Discriminative Models
where yn = p(C1 | φn ). Logistic Regression
Iterative Reweighted
Error function : negative log likelihood resulting in the Least Squares
cross-entropy error function Laplace Approximation
Bayesian Logistic
Regression
N
E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1
290of 300
25. Introduction to Statistical
Logistic Regression is Classification Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Error function (cross-entropy error ) University
N
E(w) = − {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1
Probabilistic Generative
yn = p(C1 | φn ) = σ(wT φn ) Models
Continuous Input
dσ
Gradient of the error function (using da = σ(1 − σ) ) Discrete Features
Probabilistic
N Discriminative Models
E(w) = (yn − tn )φn Logistic Regression
n=1 Iterative Reweighted
Least Squares
Laplace Approximation
gradient does not contain any sigmoid function
Bayesian Logistic
for each data point error is product of deviation yn − tn and Regression
basis function φn .
BUT : maximum likelihood solution can exhibit over-fitting
even for many data points; should use regularised error or
MAP then.
291of 300
26. Introduction to Statistical
Laplace Approximation Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Given a continous distribution p(x) which is not Gaussian,
can we approximate it by a Gaussian q(x) ?
Need to find a mode of p(x). Try to find a Gaussian with
the same mode.
Probabilistic Generative
Models
Continuous Input
0.8 40
Discrete Features
Probabilistic
0.6 30
Discriminative Models
Logistic Regression
0.4 20
Iterative Reweighted
Least Squares
0.2 10
Laplace Approximation
0 0 Bayesian Logistic
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 Regression
Non-Gaussian (yellow) and Negative log of the
Gaussian approximation (red). Non-Gaussian (yellow) and
Gaussian approx. (red).
292of 300
27. Introduction to Statistical
Laplace Approximation Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Assume p(x) can be written as
1
p(z) = f (z)
Z
Probabilistic Generative
Models
with normalisation Z = f (z) dz.
Continuous Input
Furthermore, assume Z is unknown ! Discrete Features
A mode of p(z) is at a point z0 where p (z0 ) = 0. Probabilistic
Discriminative Models
Taylor expansion of ln f (z) at z0 Logistic Regression
Iterative Reweighted
Least Squares
1
ln f (z) ln f (z0 ) − A(z − z0 )2 Laplace Approximation
2 Bayesian Logistic
Regression
where
d2
A=− ln f (z) |z=z0
dz2
293of 300
28. Introduction to Statistical
Laplace Approximation Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Exponentiating
1
ln f (z) ln f (z0 ) − A(z − z0 )2
2 Probabilistic Generative
Models
we get Continuous Input
A
f (z) f (z0 ) exp{− (z − z0 )2 }.
Discrete Features
2 Probabilistic
Discriminative Models
And after normalisation we get the Laplace approximation Logistic Regression
Iterative Reweighted
1/2 Least Squares
A A
q(z) = exp{− (z − z0 )2 }. Laplace Approximation
2π 2 Bayesian Logistic
Regression
Only defined for precision A > 0 as only then p(z) has a
maximum.
294of 300
29. Introduction to Statistical
Laplace Approximation - Vector Space Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
Approximate p(z) for z ∈ RM University
1
p(z) = f (z).
Z
we get the Taylor expansion Probabilistic Generative
Models
1 Continuous Input
ln f (z) ln f (z0 ) − (z − z0 )T A(z − z0 )
2 Discrete Features
Probabilistic
Discriminative Models
where the Hessian A is defined as
Logistic Regression
Iterative Reweighted
A=− ln f (z) |z=z0 . Least Squares
Laplace Approximation
The Laplace approximation of p(z) is then Bayesian Logistic
Regression
|A|1/2 1
q(z) = exp − (z − z0 )T A(z − z0 )
(2π)M/2 2
= N (z | z0 , A−1 )
295of 300
30. Introduction to Statistical
Bayesian Logistic Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Exact Bayesian inference for the logistic regression is Probabilistic Generative
intractable. Models
Continuous Input
Why? Need to normalise a product of prior probabilities
Discrete Features
and likelihoods which itself are a product of logistic Probabilistic
sigmoid functions, one for each data point. Discriminative Models
Logistic Regression
Evaluation of the predictive distribution also intractable. Iterative Reweighted
Least Squares
Therefore we will use the Laplace approximation.
Laplace Approximation
Bayesian Logistic
Regression
296of 300
31. Introduction to Statistical
Bayesian Logistic Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Assume a Gaussian prior because we want a Gaussian
posterior.
p(w) = N (w | m0 , S0 ) Probabilistic Generative
Models
for fixed hyperparameter m0 and S0 . Continuous Input
Hyperparameters are parameters of a prior distribution. In Discrete Features
contrast to the model parameters w, they are not learned. Probabilistic
Discriminative Models
For a set of training data (xn , tn ), where n = 1, . . . , N, the Logistic Regression
posterior is given by Iterative Reweighted
Least Squares
Laplace Approximation
p(w | t) ∝ p(w)p(t | w) Bayesian Logistic
Regression
where t = (t1 , . . . , tN )T .
297of 300
32. Introduction to Statistical
Bayesian Logistic Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
Using our previous result for the cross-entropy function
N
E(w) = − ln p(t | w) = − {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1 Probabilistic Generative
Models
we can now calculate the log of the posterior Continuous Input
Discrete Features
p(w | t) ∝ p(w)p(t | w) Probabilistic
Discriminative Models
Logistic Regression
using the notation yn = σ(wT φn ) as Iterative Reweighted
Least Squares
1 Laplace Approximation
ln p(w | t) = − (w − m0 )T S−1 (w − m0 )
0
2 Bayesian Logistic
Regression
N
+ {tn ln yn + (1 − tn ) ln(1 − yn )}
n=1
298of 300
33. Introduction to Statistical
Bayesian Logistic Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
To obtain a Gaussian approximation to
1
ln p(w | t) = − (w − m0 )T S−1 (w − m0 )
0
2
N Probabilistic Generative
Models
+ {tn ln yn + (1 − tn ) ln(1 − yn )} Continuous Input
n=1 Discrete Features
Probabilistic
Discriminative Models
1 Find wMAP which maximises ln p(w | t). This defines the Logistic Regression
mean of the Gaussian approximation. (Note: This is a Iterative Reweighted
nonlinear function in w because yn = σ(wT φn ).) Least Squares
2 Calculate the second derivative of the negative log likelihood Laplace Approximation
to get the inverse covariance of the Laplace approximation Bayesian Logistic
Regression
N
SN = − ln p(w | t) = S−1 +
0 yn (1 − yn )φn φT .
n
n=1
299of 300
34. Introduction to Statistical
Bayesian Logistic Regression Machine Learning
c 2011
Christfried Webers
NICTA
The Australian National
University
The approximated Gaussian (via Laplace approximation)
of the posterior distribution is now Probabilistic Generative
Models
Continuous Input
q(w | φ) = N (w | wMAP , SN )
Discrete Features
Probabilistic
where Discriminative Models
Logistic Regression
N
S−1
Iterative Reweighted
SN = − ln p(w | t) = 0 + yn (1 − yn )φn φT .
n Least Squares
n=1 Laplace Approximation
Bayesian Logistic
Regression
300of 300