PRML Chapter 10

Chapter 10
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics

Chapter 10. Approximate Inference
2
What are we doing?
In this chapter, we are doing various mathematical works.
Here, we have to think keep in mind what we are doing. O.W. we just forget what we are doing!
Rewind latent model which we covered in previous chapter.
We should compute the expectation of complete log likelihood.
For gaussian mixture model, this was tractable.
However, many other models are very difficult to perform these tasks.
1. Continuous density should do integration.
2. Discrete density should do summation.
1st case can be intractable when its hard to compute closed form.
2nd case can be intractable when there exist infinite number of possible cases.
e.g. Big-data with quadratic complexity.
Under such conditions, there are two possible treatment.
1. Sampling methods (Chapter 11)
2. Deterministic approximation (In this chapter)
In this chapter, we are going to study some approximate methods, which can be done by changing the functional form!

Chapter 10.1. Variational Inference
3
Terminology and idea
Before moving on, we have to talk about some basic terminologies related to this chapter.
1. Function : Mapping that takes the value of a variable as the input and returns the value of the function as the output.
2. Derivative of function : How the output value varies as we make infinitesimal(really small) changes to the input values.
3. Functional : Mapping that takes a function as the input and returns the value of the functional as the output.
4. Example of function is an entropy, 𝐻 𝑝 = 𝑝 𝑥 ln 𝑝(𝑥) 𝑑𝑥
5. Derivative of functional : Same as case 2, but input is a function and output is a functional.
Now, let’s return to the EM algorithm.
Here, 𝐿(𝑞) could be maximized by making KL zero, which
indicates the equality of posterior and prior!
Then, what should we do when it’s hard to compute posterior??
To make this feasible, we can restrict the form of prior, in order to make
posterior computable and flexible!
Yellow : Original distribution
Red : Laplace approximation
Green : Variational approximation

4
Factorized distribution
First, let’s begin with factorization method.
In this example, we restrict the family of prior distribution 𝑞(𝑍).
Here, main idea is ‘partitioning elements of variable 𝒁 to 𝑴 disjoint sets.’
Note that we are not restricting specific functional form of 𝑞𝑖(𝑍𝑖). We only assume they can be partitioned into disjoint sets.
Then here, we can re-write 𝐿(𝑞) by following.

5
Factorized distribution
Under this condition, now suppose we are maximizing 𝐿(𝑞) while fixing 𝑞𝑖≠𝑗.
Solution is easy since 𝑳(𝒒) is a negative KL between 𝒒𝒋 and 𝒑(𝑿, 𝒁𝒋)
So, 𝒎𝒂𝒙𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑳 𝒒 → 𝒎𝒊𝒏𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑲𝑳(𝒒| 𝒑 → 𝒒 = 𝒑. Finally, we can get
Note that const. makes this functional form a probability density function!
Thus, we can simply normalize this function!
Let’s revise once.
Original EM has equation of 𝑞 𝑍 = 𝑝(𝑍|𝑋). This was clear pretty good, but if we computing posterior is intractable, we cannot move further.
However, by changing equation to 𝒍𝒏 𝒒𝒋(𝒁𝒋) = 𝑬𝒊≠𝒋 𝒍𝒏 𝒑 𝑿, 𝒁 , we can update prior without computing posterior!
Here, note the equation depends on other 𝑞𝑖 𝑍𝑖 . Thus, it cannot be done in one optimization.
This should be done iteratively, updating every 𝑞𝑖 sequentially.
Now, let’s see an example!

6
Properties of factorized approximations
We can see the characteristic of this method by studying example of multivariate gaussian.
Here, we are going to express factorized gaussian with 𝑞 𝑧 = 𝑞1 𝑧1 𝑞2 𝑧2
By using the following equation,
Luckily, you can see the right-hand side equation can be expressed by quadratic term!
Which means, this equation follows gaussian distribution!
By symmetry,
Fortunately, this example is expressed in
a closed form.
Furthermore, as you can see,
𝐸 𝑧1 = 𝜇1, 𝐸 𝑧2 = 𝜇2.
Thus, the mean of the estimated
distribution fits the original distribution.
This will be shown in a following figure.

7
Properties of factorized approximations
Secondly, there is a new approach, which changes the order of KL-divergence (from 𝐾𝐿(𝑞| 𝑝 → 𝐾𝐿(𝑝||𝑞))
Note that KL divergence is not a symmetry function!
Equation can be written as
KL 𝑝||𝑞 = 𝑝 𝑍 ln 𝒒(𝒁) 𝑑𝑍 + 𝑝(𝑍) ln 𝑝(𝑍) dZ
Equation can be expressed in a closed form by
This means prior distribution of 𝑞 𝑍𝑗 is equal to the
marginal distribution of data itself, 𝑝(𝑍𝑖).
𝑲𝑳(𝒒||𝒑) 𝑲𝑳(𝒑||𝒒) 𝑲𝑳(𝒒||𝒑)
𝑲𝑳(𝒑||𝒒)
Since expectation was same,
it finds the local mean well.

8
Univariate Gaussian Exmaple
Now, let’s think of univariate gaussian example of mean 𝝁 and precision 𝝉, with given dataset 𝐷 = {𝑥1, 𝑥2, … , 𝑥𝑁}
Here, likelihood function and conjugate prior can be given as
Goal of this task is to find posterior distribution of these parameters!
By using factorization, we can express pseudo-posterior by…
Note that this expression is not a perfect posterior.
Major framework
Now, by using 𝒑 𝑫, 𝝁, 𝝉 = 𝒑 𝑫 𝝁, 𝝉 𝒑 𝝁 𝝉 𝒑(𝝉), we can estimate function of 𝒒 𝝁 , 𝒒 𝝉 , respectively.
We don’t need 𝒑(𝝉) here,
so throw it away!

9
Major framework
Interesting part is that we did not assume any function form of 𝑞(. ), but it satisfies the conjugate priors!!
Here, 𝜇 and 𝜏 shares the parameter respectively.
Thus, they should be estimated respectively with 𝐸 𝜇
Let’s see how this iterative optimization occurs!

10
Please see how 𝒒𝝁(𝝁)𝒒𝝉(𝝉) fits the desired true posterior of 𝒑(𝝁, 𝝉|𝑫)
In optimization formula, there exist some moment of
parameters. They can be optimized by…
Which is an UMVE of gaussian variance

11
Model comparison
This methodology can be applied to not only posterior distribution, but also to model comparison!
Let index of each model 𝑚. Unfortunately, each model has different structure, we cannot simply express them in terms of factorization.
Rather, we can use 𝑞 𝑍, 𝑚 = 𝑞 𝑍 𝑚 𝑞(𝑚)
** For convenience, we are assuming discrete variables
(That’s why it has summation)
From this, we can get the proportional relation of
Rather than direct application, we optimize 𝐿 with individual
𝑞(𝑍|m) respectively.
Then, we compute 𝑞(𝑚) and compare them!

Chapter 10.2. Variational Mixture of Gaussians
12
Variational Mixture of Gaussians
Let’s re-visit gaussian mixture example.
Here, variational mixture approach can give great help to fully Bayesian treatment.
Here, what does ‘fully Bayesian treatment’ mean?
Consider parameters of 𝜋, 𝜇𝑘, Σ𝑘 in gaussian mixture model.
We estimated them by using EM, but that was actually parametric approach, which means they are fixed constant.
Here, we are talking about fully Bayesian approach, assuming specific distribution for each parameter. Let’s see how formula forms.
Latent distribution
Likelihood
Now, we have to approximate their distribution!
Prior distributions!

13
Variational distribution
There are several difficulties in straight implementation.
To make things easier, we are using variational method, that uses factorization skill!
Key variational distribution is
Note that they are not exact form.
We are finding these approximations by two-stage,
1. Finding 𝑞(𝑍)
2. Finding 𝑞(𝜋, 𝜇, Λ)
1. Finding 𝒒⋆
(𝒁) We are only taking the distributions
that are related to latent variable 𝑍.
This form can be obtained simply
By using these equations.
Here, D is a
dimensionality of
data-matrix X

14
Here, please note that the form of variational distribution of 𝑍 is equivalent to that of prior.
Furthermore, we can compute responsibility as
Furthermore, for the simplicity, let’s define some variables.

15
Here, we can discover some interesting facts.
1. 𝜋 only relates to the 𝑍.
2. Other parameters are linked together.
That is, we can decompose this variational distribution into
2. Finding 𝒒⋆
(𝝅, 𝝁, 𝚲)

16
For the 𝑀 − 𝑆𝑡𝑒𝑝, it requires estimation oof expectation of some random variables. (Which were denoted by 𝔼())
They can be written as
Here, 𝜓(. ) denotes a digamma function,
which is a log-derivative of gamma function.
Finally, we can compute responsibility as

17
Now, let’s revise overall procedure!
1. We compute moments which we covered in previous slide and find estimated moment. (𝐸 − 𝑆𝑡𝑒𝑝)
2. Then, we update each posterior with newly calculated variables. (𝑀 − 𝑆𝑡𝑒𝑝)
3. Iteratively continue this process until converge.
This variational Bayesian GMM has great strength that it automatically
searches optimal number of components.
This means a model sequentially abandons relatively weak component.
Here, weak component indicates a component that contains relatively less
number of data point!
Note that as number of data (𝑁) increases, overall equation converges to that of
basic gaussian mixture.(MLE solution).
Still, this model has strength in overfitting and data collapsing.

18
Variational lower bound
Consider various deep-learning model. We are printing loss of model every few intervals.
Likewise, we can print lower bound to check whether model is being trained properly.
Because lower bound (L(q)) should never be decreased!
Here, 𝐻 indicates the entropy
value of Wishart

19
Predictive distribution
Since this is a clustering algorithm, it should be able to predict a newly observed data’s cluster label.
Just like other Bayesian models, it marginalize out the parameters to obtain desired predictive distribution!
I’ll skip this derivation of student t –
distribution in this procedure.
𝑘 which yields the greatest probability can be
chosen as the clustering result!

20
Determining the number of components
Please check that clustering result is not a fixed one.
It depends on the initialization of each prior values!
Thus, we can define the optimal number of components by computing likelihood value
of each number of components.
This is inappropriate for basic GMM, since likelihood of GMM monotonically increases
as the number of component increases!
However, in Bayesian approach, it has intrinsic trade-off between model complexity
and likelihood value, which acts as automatic regularization.
Thus, comparing the 𝐿(𝑞) is reasonable in Bayesian GMM.
Furthermore, I have mentioned Bayesian GMM has intrinsic power to select optimal
number of components by setting 𝜋𝑘 converges to zero. This notion is called
automatic relevance determination.

Chapter 10.3. Variational Linear Regression
21
Fully Bayesian linear regression
Now we are studying variational approach of linear regression.
In previous Bayesian linear models, we did not assume the distribution of priors, which means
𝛼 in the 𝑝 𝑤 𝛼 = 𝑁 𝑤 0, 𝛼−1
𝐼) is a constant which can be optimized by evidence approximation.
However, we are treating 𝛼 as random variable which contains its own distribution!
Here we set distribution of 𝛼 by a gamma distribution.
Joint distribution of parameters can be defined as
Now, let’s find variational distribution of 𝑤 and 𝛼. By using basic notion of factorization method,
Approximation can be done by
Note that 𝑴 indicates the number of
basis functions!

22
Fully Bayesian linear regression
Since it has a form of gamma distribution, we can approximate this variational distribution by gamma!
This way, we can find 𝑞(𝛼). Similarly,
we can get a distribution of 𝑤.
Note that the term of 𝑤 exist in a quadratic form, which indicates a
gaussian distribution!
Note that overall equation only differs at the 𝑬[𝜶] from chp 3.
This expression becomes relatively similar to that of EM, when we
set α0 = β0 = 0, which means we are not having any prior
knowledge about the parameter 𝜶

23
Predictive distribution & Lower bound of variational linear regression
Big idea is same. Just details are different.
But here, we are using 𝐸[𝛼] instead of simple 𝛼
Similarly, we can evaluate lower bound to compare different models.
Ground-Truth : 𝑋3
𝑥 − 𝑎𝑥𝑖𝑠 ∶ 𝑃𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑜𝑓 𝑋
𝑦 − 𝑎𝑥𝑖𝑠 ∶ 𝐿(𝑞)
We can see it is being maximized in 𝑴 = 𝟑

Chapter 10.4. Exponential Family Distributions
24
Exponential Family Distributions
We have covered exponential family in mathematical statistics II and PRML chapter II.
Let’s revise it for short.
Note that 𝒈(𝜼) is a normalizing constant, which makes it as a probability!
Why is it important?
1. Most of the distributions we know belong to this exponential family.
2. Distributions that belong to this family have conjugate priors.
3. Have sufficient statistics (UMVUE)
4. Posterior can be expressed as a closed form.
Actually, we have covered various models in an exponential family distribution.
But they were actually some special cases(examples). Thus, let’s study it under general framework of exponential family.

Chapter 10.4. Exponential Family Distributions
25
Now, we have to define variational distribution of 𝑧.
Here, 𝒒 𝒁, 𝜼 = 𝒒 𝒁 𝒒(𝜼). According to formula, we can re-write it by
Variational distribution can be
decomposed into each individual data
E – Step : Computing 𝐸𝜂[𝑢(𝑥𝑛, 𝑧𝑛)] / Filling unseen-latent variable.
M - Step : Updating each distributions.
Furthermore, we can write this
In variational message passing!

Chapter 10.5. Local Variational Methods
26
Local approach of variational methods
Note that all previous methods were ‘global approach’, that we have updated equation entirely.
Now, we are going to use local approach, which updates part of the entire equation.
Reason we are doing such method is to simplify the resulting distribution!!
First, we discuss about convex function, which every chord lies on its functional value
Here we get help of some max / min values. Let’s discuss it in example.
Let’s consider function 𝑓 𝑥 = exp(−𝑥) which was drawn in red line.
We are trying to approximate 𝑓(𝑥) with a simpler function.
By using taylor series, we can re-write 𝑓(𝑥) by 𝑓 𝑥 ≈ 𝒚 𝒙 = 𝒇 𝝃 + 𝒇′
𝝃 𝒙 − 𝝃
This is a tangent line of 𝑓(𝑥) at the point of (𝜉, 𝑓(𝜉)).
As you can see in the left-hand side figure, every tangent line lies beneath the original function.
Thus, such inequality satisfies.
𝑦 𝑥 ≤ 𝑓 𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑥 = 𝜉
Now, above tangent line can be re-written as
𝒚 𝒙 = 𝐞𝐱𝐩 −𝝃 − 𝐞𝐱𝐩(−𝝃)(𝒙 − 𝝃)

27
For generality, let’s assume 𝝀 = −𝒆𝒙𝒑(−𝝃). Equation can be re-written as
𝑦 𝑥, 𝜆 = 𝜆𝑥 − 𝜆 + 𝜆 ln(−𝜆)
Now, let’s fix 𝑥. Then there might be various tangent line.
Like right figure, tangent line at that point gives the maximum value of approximation, which is equal
to the original functional value.
Thus, this can be 𝒇 𝒙 = 𝒎𝒂𝒙
𝝀
(𝝀𝒙 − 𝝀 + 𝝀 𝒍𝒏 𝝀)
Now, let’s move on to more general case of function! We fix 𝝀 and change 𝒙!
We do not set any functional shape, we only guarantee the characteristic of convexity!
Just like left figure, arbitrary line which passes the
origin can be approximated to original function by
moving a bit up(or down).
That intercept value can be denoted as 𝑔(𝜆).
It is a minimum vertical distance between original
𝑓(𝑥) and line 𝑦(𝑥). (By moving 𝑔(𝜆) it kisses the 𝑓(𝑥))

28
Thus, finding λ can be written in
Now, let’s fix 𝒙 and change 𝝀 just as we did in the first example. Equation can be written as
For the concave function, which looks like… (right figure)
** Easy way to discriminate them : Think of ‘cave’ and shape of it!
Back to the point, we can apply same equations, only changing the side and sign of them.
That is, max to min and min to max!
Actually, we don’t need to memorize it. Just draw a simple function and simply think of it!

29
Logistic sigmoid example
Upper bound of logistic.
Consider logistic sigmoid function, which is defined as 𝜎 𝑥 =
1
1+𝑒−𝑥.
Note that it is neither concave nor convex.
However, if we take logarithm, resulting value is concave. Therefore, we can derive
It is binary cross entropy of binary classification model!!
We then obtain upper bound by
𝐥𝐧 𝝈(𝒙) ≤ 𝝀𝒙 − 𝒈(𝝀) 𝝈 𝒙 ≤ 𝒆𝒙𝒑(𝝀𝒙 − 𝒈(𝝀))
That is, we can find the upper bound of logistic function! (The reason why we are finding upper/lower
bound will be covered soon!)
Lower bound of logistic.
Deriving lower bound is a bit more complicated. Let’s re-write ln 𝜎(𝑥) by…

30
Note that
𝒙
𝟐
is a monotonically increase function, and − ln(. ) part is a convex. Thus, we can approximate lower bound with
𝑓 𝑥 = − ln 𝑒
𝑥
2 + 𝑒−
𝑥
2, which is a convex function with respect to 𝑥2
. (Let’s skip the proof)
By using general idea of convex function and max/min relation, we can compute
Consider the kissing point as 𝜉. We can set derivative zero and rewrite equation!
Finally, we can derive lower bound of logistic as follows.
By using exponential
function…

31
Actually, this is the main part of this sub-section.
Why are we doing such complicated works???
It’s because it can make intractable integration possible.
Consider Bayesian predictive distribution!
Note that 𝑝(𝑎) is a gaussian distribution!
However, it was really hard to compute. Then, what if we replace 𝜎(𝑎) with its lower bound,
𝝈 𝒂 ≥ 𝒇(𝒂, 𝝃)
Here, we have to choose 𝜉, which gives maximal value to 𝑭(𝝃)!
Note that choice of 𝝃 depends on the choice of 𝒂.
Let’s see how this idea is being used in the following section!

Chapter 10.6. Variational Logistic Regression
32
Variational approximation of logistic
Note that it was nearly impossible to find closed form of posterior / predictive distribution in logistic regression.
Thus, we used Laplace approximation for computing such values.
Now, we are going to use variational method to approximate them!
As we did in previous section, we are going to find them by using lower bound. More precisely, maximizing the lower bound!
By IID. Assumption, we can express
Where 𝒂 = 𝑾𝑻
𝝓
Here, please remember how we solved lower bound of logistic!

33
1. Note that 𝑎(input of logistic) was 𝑊𝑇
𝜙.
2. Under variational setting, we assume each data are independent and we apply different 𝝃𝒏 for every data point! Thus,
In order to achieve the computational convenience here,
We put logarithm on each value!
Here, we are trying to get 𝒒(𝒘). By this, we can induce…

34
Here, we consider 𝜉 as a certain constant. Thus, we can sort right-hand side equation with respect to 𝑤.
Note that 𝐰 is having a form of quadratic equation! Thus, we can assume 𝒘 follows Normal distribution!!
There is one interesting part in this equation.
That is, we can calculate this in a sequential manner! Why?
Likewise, 𝑚𝑁 can also be computed in this way!

35
Optimizing the variational parameters
We use EM for parameter optimization of 𝑚, 𝑆 𝑎𝑛𝑑 𝜉.
It would be beneficial to think of 𝑤 as a latent variable! Let’s revise EM for short.
Note that this expectation
was done with respect to 𝒘!
Does not
depend on 𝝃
Canceled out
with following
𝟎. 𝟓𝑾𝑻𝝓
In order to find new 𝜉, we
have to find derivative of this
𝑄 with respect to 𝜉.

36
New variational parameter 𝝃𝒏
Let’s revise EM in variational logistic for short.
1. Initialize 𝜉𝑛
𝑜𝑙𝑑
2. Evaluate posterior distribution of q(𝑤)
3. Compute 𝜉𝑛
𝑛𝑒𝑤
Otherwise, we can simply(?) calculate 𝐿(𝑄) and set derivative of 𝜉
For the predictive distribution(which is most crucial in logistic regression),
Then we can use same techniques after this, as we did in chapter 4.

37
Inference of hyper parameters
There was a prior variance 𝑆0. We have assumed that we know that value, but now we treat it as a random variable.
This is also a fully Bayesian treatment! Consider prior distribution is given as
𝑝 𝑤 𝛼 = 𝑁(𝑤|0, 𝛼−1
𝐼)
As usual, we set distribution of 𝛼 as a conjugate gamma. Which is
𝑝 𝛼 = 𝐺𝑎𝑚𝑚𝑎(𝛼|𝑎0, 𝑏0)
Marginal model likelihood and joint distribution are given as
As we have done so-many times, we introduce variational distribution 𝒒(𝒘, 𝜶).
To perform expectation maximization, we can again compute marginal likelihood function by
However, note that 𝐿(𝑞) is intractable
due to existence of 𝑝(𝑡|𝑤).
Thus, we apply lower bound here.

38
By using the fact of
After it, let’s compute 𝒒 𝒘, 𝜶 = 𝒒 𝒘 𝒒(𝜶) indivually!
We can compute lower-bound as follows
Why Normal?
Because 𝒘 exists in a quadratic
form under log function!
Why Gamma?
Because 𝜶 exists in a 1st order
form under log function!

39
Furthermore, we have to compute 𝜉 for full model! For expectation values, we can compute as follows!
Let’s revise variational approach once again.
1. Sometimes its hard to compute posterior and integration of certain functions.
2. We assume variational structure, which considers one variable with one latent value or parameter.
3. It can be broken into two parts like 𝑞(𝑤, 𝛼).
4. Reason why we are doing such approximation is because posterior or other integration is intractable under full-equation.
5. This is helpful in fully-Bayesian model, which also assumes distribution for hyper-priors.
6. Note that optimization of such models can be done by EM algorithm.
Let’s skip expectation propagation. We’ll cover it if we have time… (End of summer break is coming… Last semester is about to begin…!)

PRML Chapter 10

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PRML Chapter 10

Similar to PRML Chapter 10 (20)

Recently uploaded

Recently uploaded (20)

PRML Chapter 10