SlideShare a Scribd company logo
Chapter 10
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
Chapter 10. Approximate Inference
2
What are we doing?
In this chapter, we are doing various mathematical works.
Here, we have to think keep in mind what we are doing. O.W. we just forget what we are doing!
Rewind latent model which we covered in previous chapter.
We should compute the expectation of complete log likelihood.
For gaussian mixture model, this was tractable.
However, many other models are very difficult to perform these tasks.
1. Continuous density should do integration.
2. Discrete density should do summation.
1st case can be intractable when its hard to compute closed form.
2nd case can be intractable when there exist infinite number of possible cases.
e.g. Big-data with quadratic complexity.
Under such conditions, there are two possible treatment.
1. Sampling methods (Chapter 11)
2. Deterministic approximation (In this chapter)
In this chapter, we are going to study some approximate methods, which can be done by changing the functional form!
Chapter 10.1. Variational Inference
3
Terminology and idea
Before moving on, we have to talk about some basic terminologies related to this chapter.
1. Function : Mapping that takes the value of a variable as the input and returns the value of the function as the output.
2. Derivative of function : How the output value varies as we make infinitesimal(really small) changes to the input values.
3. Functional : Mapping that takes a function as the input and returns the value of the functional as the output.
4. Example of function is an entropy, 𝐻 𝑝 = 𝑝 𝑥 ln 𝑝(𝑥) 𝑑𝑥
5. Derivative of functional : Same as case 2, but input is a function and output is a functional.
Now, let’s return to the EM algorithm.
Here, 𝐿(𝑞) could be maximized by making KL zero, which
indicates the equality of posterior and prior!
Then, what should we do when it’s hard to compute posterior??
To make this feasible, we can restrict the form of prior, in order to make
posterior computable and flexible!
Yellow : Original distribution
Red : Laplace approximation
Green : Variational approximation
Chapter 10.1. Variational Inference
4
Factorized distribution
First, let’s begin with factorization method.
In this example, we restrict the family of prior distribution 𝑞(𝑍).
Here, main idea is ‘partitioning elements of variable 𝒁 to 𝑴 disjoint sets.’
Note that we are not restricting specific functional form of 𝑞𝑖(𝑍𝑖). We only assume they can be partitioned into disjoint sets.
Then here, we can re-write 𝐿(𝑞) by following.
Chapter 10.1. Variational Inference
5
Factorized distribution
Under this condition, now suppose we are maximizing 𝐿(𝑞) while fixing 𝑞𝑖≠𝑗.
Solution is easy since 𝑳(𝒒) is a negative KL between 𝒒𝒋 and 𝒑(𝑿, 𝒁𝒋)
So, 𝒎𝒂𝒙𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑳 𝒒 → 𝒎𝒊𝒏𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑲𝑳(𝒒| 𝒑 → 𝒒 = 𝒑. Finally, we can get
Note that const. makes this functional form a probability density function!
Thus, we can simply normalize this function!
Let’s revise once.
Original EM has equation of 𝑞 𝑍 = 𝑝(𝑍|𝑋). This was clear pretty good, but if we computing posterior is intractable, we cannot move further.
However, by changing equation to 𝒍𝒏 𝒒𝒋(𝒁𝒋) = 𝑬𝒊≠𝒋 𝒍𝒏 𝒑 𝑿, 𝒁 , we can update prior without computing posterior!
Here, note the equation depends on other 𝑞𝑖 𝑍𝑖 . Thus, it cannot be done in one optimization.
This should be done iteratively, updating every 𝑞𝑖 sequentially.
Now, let’s see an example!
Chapter 10.1. Variational Inference
6
Properties of factorized approximations
We can see the characteristic of this method by studying example of multivariate gaussian.
Here, we are going to express factorized gaussian with 𝑞 𝑧 = 𝑞1 𝑧1 𝑞2 𝑧2
By using the following equation,
Luckily, you can see the right-hand side equation can be expressed by quadratic term!
Which means, this equation follows gaussian distribution!
By symmetry,
Fortunately, this example is expressed in
a closed form.
Furthermore, as you can see,
𝐸 𝑧1 = 𝜇1, 𝐸 𝑧2 = 𝜇2.
Thus, the mean of the estimated
distribution fits the original distribution.
This will be shown in a following figure.
Chapter 10.1. Variational Inference
7
Properties of factorized approximations
Secondly, there is a new approach, which changes the order of KL-divergence (from 𝐾𝐿(𝑞| 𝑝 → 𝐾𝐿(𝑝||𝑞))
Note that KL divergence is not a symmetry function!
Equation can be written as
KL 𝑝||𝑞 = 𝑝 𝑍 ln 𝒒(𝒁) 𝑑𝑍 + 𝑝(𝑍) ln 𝑝(𝑍) dZ
Equation can be expressed in a closed form by
This means prior distribution of 𝑞 𝑍𝑗 is equal to the
marginal distribution of data itself, 𝑝(𝑍𝑖).
𝑲𝑳(𝒒||𝒑) 𝑲𝑳(𝒑||𝒒) 𝑲𝑳(𝒒||𝒑)
𝑲𝑳(𝒑||𝒒)
Since expectation was same,
it finds the local mean well.
Chapter 10.1. Variational Inference
8
Univariate Gaussian Exmaple
Now, let’s think of univariate gaussian example of mean 𝝁 and precision 𝝉, with given dataset 𝐷 = {𝑥1, 𝑥2, … , 𝑥𝑁}
Here, likelihood function and conjugate prior can be given as
Goal of this task is to find posterior distribution of these parameters!
By using factorization, we can express pseudo-posterior by…
Note that this expression is not a perfect posterior.
Major framework
Now, by using 𝒑 𝑫, 𝝁, 𝝉 = 𝒑 𝑫 𝝁, 𝝉 𝒑 𝝁 𝝉 𝒑(𝝉), we can estimate function of 𝒒 𝝁 , 𝒒 𝝉 , respectively.
We don’t need 𝒑(𝝉) here,
so throw it away!
Chapter 10.1. Variational Inference
9
Univariate Gaussian Exmaple
Major framework
Interesting part is that we did not assume any function form of 𝑞(. ), but it satisfies the conjugate priors!!
Here, 𝜇 and 𝜏 shares the parameter respectively.
Thus, they should be estimated respectively with 𝐸 𝜇
Let’s see how this iterative optimization occurs!
Chapter 10.1. Variational Inference
10
Univariate Gaussian Exmaple
Please see how 𝒒𝝁(𝝁)𝒒𝝉(𝝉) fits the desired true posterior of 𝒑(𝝁, 𝝉|𝑫)
In optimization formula, there exist some moment of
parameters. They can be optimized by…
Which is an UMVE of gaussian variance
Chapter 10.1. Variational Inference
11
Model comparison
This methodology can be applied to not only posterior distribution, but also to model comparison!
Let index of each model 𝑚. Unfortunately, each model has different structure, we cannot simply express them in terms of factorization.
Rather, we can use 𝑞 𝑍, 𝑚 = 𝑞 𝑍 𝑚 𝑞(𝑚)
** For convenience, we are assuming discrete variables
(That’s why it has summation)
From this, we can get the proportional relation of
Rather than direct application, we optimize 𝐿 with individual
𝑞(𝑍|m) respectively.
Then, we compute 𝑞(𝑚) and compare them!
Chapter 10.2. Variational Mixture of Gaussians
12
Variational Mixture of Gaussians
Let’s re-visit gaussian mixture example.
Here, variational mixture approach can give great help to fully Bayesian treatment.
Here, what does ‘fully Bayesian treatment’ mean?
Consider parameters of 𝜋, 𝜇𝑘, Σ𝑘 in gaussian mixture model.
We estimated them by using EM, but that was actually parametric approach, which means they are fixed constant.
Here, we are talking about fully Bayesian approach, assuming specific distribution for each parameter. Let’s see how formula forms.
Latent distribution
Likelihood
Now, we have to approximate their distribution!
Prior distributions!
Chapter 10.2. Variational Mixture of Gaussians
13
Variational distribution
There are several difficulties in straight implementation.
To make things easier, we are using variational method, that uses factorization skill!
Key variational distribution is
Note that they are not exact form.
We are finding these approximations by two-stage,
1. Finding 𝑞(𝑍)
2. Finding 𝑞(𝜋, 𝜇, Λ)
1. Finding 𝒒⋆
(𝒁) We are only taking the distributions
that are related to latent variable 𝑍.
This form can be obtained simply
By using these equations.
Here, D is a
dimensionality of
data-matrix X
Chapter 10.2. Variational Mixture of Gaussians
14
Variational distribution
Here, please note that the form of variational distribution of 𝑍 is equivalent to that of prior.
Furthermore, we can compute responsibility as
Furthermore, for the simplicity, let’s define some variables.
Chapter 10.2. Variational Mixture of Gaussians
15
Variational distribution
Here, we can discover some interesting facts.
1. 𝜋 only relates to the 𝑍.
2. Other parameters are linked together.
That is, we can decompose this variational distribution into
2. Finding 𝒒⋆
(𝝅, 𝝁, 𝚲)
Chapter 10.2. Variational Mixture of Gaussians
16
Variational distribution
For the 𝑀 − 𝑆𝑡𝑒𝑝, it requires estimation oof expectation of some random variables. (Which were denoted by 𝔼())
They can be written as
Here, 𝜓(. ) denotes a digamma function,
which is a log-derivative of gamma function.
Finally, we can compute responsibility as
Chapter 10.2. Variational Mixture of Gaussians
17
Variational distribution
Now, let’s revise overall procedure!
1. We compute moments which we covered in previous slide and find estimated moment. (𝐸 − 𝑆𝑡𝑒𝑝)
2. Then, we update each posterior with newly calculated variables. (𝑀 − 𝑆𝑡𝑒𝑝)
3. Iteratively continue this process until converge.
This variational Bayesian GMM has great strength that it automatically
searches optimal number of components.
This means a model sequentially abandons relatively weak component.
Here, weak component indicates a component that contains relatively less
number of data point!
Note that as number of data (𝑁) increases, overall equation converges to that of
basic gaussian mixture.(MLE solution).
Still, this model has strength in overfitting and data collapsing.
Chapter 10.2. Variational Mixture of Gaussians
18
Variational lower bound
Consider various deep-learning model. We are printing loss of model every few intervals.
Likewise, we can print lower bound to check whether model is being trained properly.
Because lower bound (L(q)) should never be decreased!
Here, 𝐻 indicates the entropy
value of Wishart
Chapter 10.2. Variational Mixture of Gaussians
19
Predictive distribution
Since this is a clustering algorithm, it should be able to predict a newly observed data’s cluster label.
Just like other Bayesian models, it marginalize out the parameters to obtain desired predictive distribution!
I’ll skip this derivation of student t –
distribution in this procedure.
𝑘 which yields the greatest probability can be
chosen as the clustering result!
Chapter 10.2. Variational Mixture of Gaussians
20
Determining the number of components
Please check that clustering result is not a fixed one.
It depends on the initialization of each prior values!
Thus, we can define the optimal number of components by computing likelihood value
of each number of components.
This is inappropriate for basic GMM, since likelihood of GMM monotonically increases
as the number of component increases!
However, in Bayesian approach, it has intrinsic trade-off between model complexity
and likelihood value, which acts as automatic regularization.
Thus, comparing the 𝐿(𝑞) is reasonable in Bayesian GMM.
Furthermore, I have mentioned Bayesian GMM has intrinsic power to select optimal
number of components by setting 𝜋𝑘 converges to zero. This notion is called
automatic relevance determination.
Chapter 10.3. Variational Linear Regression
21
Fully Bayesian linear regression
Now we are studying variational approach of linear regression.
In previous Bayesian linear models, we did not assume the distribution of priors, which means
𝛼 in the 𝑝 𝑤 𝛼 = 𝑁 𝑤 0, 𝛼−1
𝐼) is a constant which can be optimized by evidence approximation.
However, we are treating 𝛼 as random variable which contains its own distribution!
Here we set distribution of 𝛼 by a gamma distribution.
Joint distribution of parameters can be defined as
Now, let’s find variational distribution of 𝑤 and 𝛼. By using basic notion of factorization method,
Approximation can be done by
Note that 𝑴 indicates the number of
basis functions!
Chapter 10.3. Variational Linear Regression
22
Fully Bayesian linear regression
Since it has a form of gamma distribution, we can approximate this variational distribution by gamma!
This way, we can find 𝑞(𝛼). Similarly,
we can get a distribution of 𝑤.
Note that the term of 𝑤 exist in a quadratic form, which indicates a
gaussian distribution!
Note that overall equation only differs at the 𝑬[𝜶] from chp 3.
This expression becomes relatively similar to that of EM, when we
set α0 = β0 = 0, which means we are not having any prior
knowledge about the parameter 𝜶
Chapter 10.3. Variational Linear Regression
23
Predictive distribution & Lower bound of variational linear regression
Big idea is same. Just details are different.
But here, we are using 𝐸[𝛼] instead of simple 𝛼
Similarly, we can evaluate lower bound to compare different models.
Ground-Truth : 𝑋3
𝑥 − 𝑎𝑥𝑖𝑠 ∶ 𝑃𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑜𝑓 𝑋
𝑦 − 𝑎𝑥𝑖𝑠 ∶ 𝐿(𝑞)
We can see it is being maximized in 𝑴 = 𝟑
Chapter 10.4. Exponential Family Distributions
24
Exponential Family Distributions
We have covered exponential family in mathematical statistics II and PRML chapter II.
Let’s revise it for short.
Note that 𝒈(𝜼) is a normalizing constant, which makes it as a probability!
Why is it important?
1. Most of the distributions we know belong to this exponential family.
2. Distributions that belong to this family have conjugate priors.
3. Have sufficient statistics (UMVUE)
4. Posterior can be expressed as a closed form.
Actually, we have covered various models in an exponential family distribution.
But they were actually some special cases(examples). Thus, let’s study it under general framework of exponential family.
Chapter 10.4. Exponential Family Distributions
25
Variational distribution
Now, we have to define variational distribution of 𝑧.
Here, 𝒒 𝒁, 𝜼 = 𝒒 𝒁 𝒒(𝜼). According to formula, we can re-write it by
Variational distribution can be
decomposed into each individual data
E – Step : Computing 𝐸𝜂[𝑢(𝑥𝑛, 𝑧𝑛)] / Filling unseen-latent variable.
M - Step : Updating each distributions.
Furthermore, we can write this
In variational message passing!
Chapter 10.5. Local Variational Methods
26
Local approach of variational methods
Note that all previous methods were ‘global approach’, that we have updated equation entirely.
Now, we are going to use local approach, which updates part of the entire equation.
Reason we are doing such method is to simplify the resulting distribution!!
First, we discuss about convex function, which every chord lies on its functional value
Here we get help of some max / min values. Let’s discuss it in example.
Let’s consider function 𝑓 𝑥 = exp(−𝑥) which was drawn in red line.
We are trying to approximate 𝑓(𝑥) with a simpler function.
By using taylor series, we can re-write 𝑓(𝑥) by 𝑓 𝑥 ≈ 𝒚 𝒙 = 𝒇 𝝃 + 𝒇′
𝝃 𝒙 − 𝝃
This is a tangent line of 𝑓(𝑥) at the point of (𝜉, 𝑓(𝜉)).
As you can see in the left-hand side figure, every tangent line lies beneath the original function.
Thus, such inequality satisfies.
𝑦 𝑥 ≤ 𝑓 𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑥 = 𝜉
Now, above tangent line can be re-written as
𝒚 𝒙 = 𝐞𝐱𝐩 −𝝃 − 𝐞𝐱𝐩(−𝝃)(𝒙 − 𝝃)
Chapter 10.5. Local Variational Methods
27
Local approach of variational methods
For generality, let’s assume 𝝀 = −𝒆𝒙𝒑(−𝝃). Equation can be re-written as
𝑦 𝑥, 𝜆 = 𝜆𝑥 − 𝜆 + 𝜆 ln(−𝜆)
Now, let’s fix 𝑥. Then there might be various tangent line.
Like right figure, tangent line at that point gives the maximum value of approximation, which is equal
to the original functional value.
Thus, this can be 𝒇 𝒙 = 𝒎𝒂𝒙
𝝀
(𝝀𝒙 − 𝝀 + 𝝀 𝒍𝒏 𝝀)
Now, let’s move on to more general case of function! We fix 𝝀 and change 𝒙!
We do not set any functional shape, we only guarantee the characteristic of convexity!
Just like left figure, arbitrary line which passes the
origin can be approximated to original function by
moving a bit up(or down).
That intercept value can be denoted as 𝑔(𝜆).
It is a minimum vertical distance between original
𝑓(𝑥) and line 𝑦(𝑥). (By moving 𝑔(𝜆) it kisses the 𝑓(𝑥))
Chapter 10.5. Local Variational Methods
28
Local approach of variational methods
Thus, finding λ can be written in
Now, let’s fix 𝒙 and change 𝝀 just as we did in the first example. Equation can be written as
For the concave function, which looks like… (right figure)
** Easy way to discriminate them : Think of ‘cave’ and shape of it!
Back to the point, we can apply same equations, only changing the side and sign of them.
That is, max to min and min to max!
Actually, we don’t need to memorize it. Just draw a simple function and simply think of it!
Chapter 10.5. Local Variational Methods
29
Logistic sigmoid example
Upper bound of logistic.
Consider logistic sigmoid function, which is defined as 𝜎 𝑥 =
1
1+𝑒−𝑥.
Note that it is neither concave nor convex.
However, if we take logarithm, resulting value is concave. Therefore, we can derive
It is binary cross entropy of binary classification model!!
We then obtain upper bound by
𝐥𝐧 𝝈(𝒙) ≤ 𝝀𝒙 − 𝒈(𝝀) 𝝈 𝒙 ≤ 𝒆𝒙𝒑(𝝀𝒙 − 𝒈(𝝀))
That is, we can find the upper bound of logistic function! (The reason why we are finding upper/lower
bound will be covered soon!)
Lower bound of logistic.
Deriving lower bound is a bit more complicated. Let’s re-write ln 𝜎(𝑥) by…
Chapter 10.5. Local Variational Methods
30
Logistic sigmoid example
Note that
𝒙
𝟐
is a monotonically increase function, and − ln(. ) part is a convex. Thus, we can approximate lower bound with
𝑓 𝑥 = − ln 𝑒
𝑥
2 + 𝑒−
𝑥
2, which is a convex function with respect to 𝑥2
. (Let’s skip the proof)
By using general idea of convex function and max/min relation, we can compute
Consider the kissing point as 𝜉. We can set derivative zero and rewrite equation!
Finally, we can derive lower bound of logistic as follows.
By using exponential
function…
Chapter 10.5. Local Variational Methods
31
Logistic sigmoid example
Actually, this is the main part of this sub-section.
Why are we doing such complicated works???
It’s because it can make intractable integration possible.
Consider Bayesian predictive distribution!
Note that 𝑝(𝑎) is a gaussian distribution!
However, it was really hard to compute. Then, what if we replace 𝜎(𝑎) with its lower bound,
𝝈 𝒂 ≥ 𝒇(𝒂, 𝝃)
Here, we have to choose 𝜉, which gives maximal value to 𝑭(𝝃)!
Note that choice of 𝝃 depends on the choice of 𝒂.
Let’s see how this idea is being used in the following section!
Chapter 10.6. Variational Logistic Regression
32
Variational approximation of logistic
Note that it was nearly impossible to find closed form of posterior / predictive distribution in logistic regression.
Thus, we used Laplace approximation for computing such values.
Now, we are going to use variational method to approximate them!
As we did in previous section, we are going to find them by using lower bound. More precisely, maximizing the lower bound!
By IID. Assumption, we can express
Where 𝒂 = 𝑾𝑻
𝝓
Here, please remember how we solved lower bound of logistic!
Chapter 10.6. Variational Logistic Regression
33
Variational approximation of logistic
1. Note that 𝑎(input of logistic) was 𝑊𝑇
𝜙.
2. Under variational setting, we assume each data are independent and we apply different 𝝃𝒏 for every data point! Thus,
In order to achieve the computational convenience here,
We put logarithm on each value!
Here, we are trying to get 𝒒(𝒘). By this, we can induce…
Chapter 10.6. Variational Logistic Regression
34
Variational approximation of logistic
Here, we consider 𝜉 as a certain constant. Thus, we can sort right-hand side equation with respect to 𝑤.
Note that 𝐰 is having a form of quadratic equation! Thus, we can assume 𝒘 follows Normal distribution!!
There is one interesting part in this equation.
That is, we can calculate this in a sequential manner! Why?
Likewise, 𝑚𝑁 can also be computed in this way!
Chapter 10.6. Variational Logistic Regression
35
Optimizing the variational parameters
We use EM for parameter optimization of 𝑚, 𝑆 𝑎𝑛𝑑 𝜉.
It would be beneficial to think of 𝑤 as a latent variable! Let’s revise EM for short.
Note that this expectation
was done with respect to 𝒘!
Does not
depend on 𝝃
Canceled out
with following
𝟎. 𝟓𝑾𝑻𝝓
In order to find new 𝜉, we
have to find derivative of this
𝑄 with respect to 𝜉.
Chapter 10.6. Variational Logistic Regression
36
New variational parameter 𝝃𝒏
Let’s revise EM in variational logistic for short.
1. Initialize 𝜉𝑛
𝑜𝑙𝑑
2. Evaluate posterior distribution of q(𝑤)
3. Compute 𝜉𝑛
𝑛𝑒𝑤
Otherwise, we can simply(?) calculate 𝐿(𝑄) and set derivative of 𝜉
For the predictive distribution(which is most crucial in logistic regression),
Then we can use same techniques after this, as we did in chapter 4.
Chapter 10.6. Variational Logistic Regression
37
Inference of hyper parameters
There was a prior variance 𝑆0. We have assumed that we know that value, but now we treat it as a random variable.
This is also a fully Bayesian treatment! Consider prior distribution is given as
𝑝 𝑤 𝛼 = 𝑁(𝑤|0, 𝛼−1
𝐼)
As usual, we set distribution of 𝛼 as a conjugate gamma. Which is
𝑝 𝛼 = 𝐺𝑎𝑚𝑚𝑎(𝛼|𝑎0, 𝑏0)
Marginal model likelihood and joint distribution are given as
As we have done so-many times, we introduce variational distribution 𝒒(𝒘, 𝜶).
To perform expectation maximization, we can again compute marginal likelihood function by
However, note that 𝐿(𝑞) is intractable
due to existence of 𝑝(𝑡|𝑤).
Thus, we apply lower bound here.
Chapter 10.6. Variational Logistic Regression
38
Inference of hyper parameters
By using the fact of
After it, let’s compute 𝒒 𝒘, 𝜶 = 𝒒 𝒘 𝒒(𝜶) indivually!
We can compute lower-bound as follows
Why Normal?
Because 𝒘 exists in a quadratic
form under log function!
Why Gamma?
Because 𝜶 exists in a 1st order
form under log function!
Chapter 10.6. Variational Logistic Regression
39
Inference of hyper parameters
Furthermore, we have to compute 𝜉 for full model! For expectation values, we can compute as follows!
Let’s revise variational approach once again.
1. Sometimes its hard to compute posterior and integration of certain functions.
2. We assume variational structure, which considers one variable with one latent value or parameter.
3. It can be broken into two parts like 𝑞(𝑤, 𝛼).
4. Reason why we are doing such approximation is because posterior or other integration is intractable under full-equation.
5. This is helpful in fully-Bayesian model, which also assumes distribution for hyper-priors.
6. Note that optimization of such models can be done by EM algorithm.
Let’s skip expectation propagation. We’ll cover it if we have time… (End of summer break is coming… Last semester is about to begin…!)

More Related Content

What's hot

Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
Haezoom Inc.
 
PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10
matsuolab
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」
Keisuke Sugawara
 
PRML11章
PRML11章PRML11章
PRML11章
Takashi Tamura
 
EMアルゴリズム
EMアルゴリズムEMアルゴリズム
Chapter11.2
Chapter11.2Chapter11.2
Chapter11.2
Takuya Minagawa
 
PRML輪読#2
PRML輪読#2PRML輪読#2
PRML輪読#2
matsuolab
 
PRML Chapter 6
PRML Chapter 6PRML Chapter 6
PRML Chapter 6
Sunwoo Kim
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)
Takao Yamanaka
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7
matsuolab
 
PRML chapter7
PRML chapter7PRML chapter7
パターン認識 04 混合正規分布
パターン認識 04 混合正規分布パターン認識 04 混合正規分布
パターン認識 04 混合正規分布sleipnir002
 
VQ-VAE
VQ-VAEVQ-VAE
VQ-VAE
수철 박
 
PRML輪読#11
PRML輪読#11PRML輪読#11
PRML輪読#11
matsuolab
 
PRML輪読#13
PRML輪読#13PRML輪読#13
PRML輪読#13
matsuolab
 
階層ベイズとWAIC
階層ベイズとWAIC階層ベイズとWAIC
階層ベイズとWAIC
Hiroshi Shimizu
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12
Sunwoo Kim
 
PRML輪読#8
PRML輪読#8PRML輪読#8
PRML輪読#8
matsuolab
 
PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7
Sunwoo Kim
 
PRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワーク
PRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワークPRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワーク
PRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワーク
KokiTakamiya
 

What's hot (20)

Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...
 
PRML輪読#10
PRML輪読#10PRML輪読#10
PRML輪読#10
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」
 
PRML11章
PRML11章PRML11章
PRML11章
 
EMアルゴリズム
EMアルゴリズムEMアルゴリズム
EMアルゴリズム
 
Chapter11.2
Chapter11.2Chapter11.2
Chapter11.2
 
PRML輪読#2
PRML輪読#2PRML輪読#2
PRML輪読#2
 
PRML Chapter 6
PRML Chapter 6PRML Chapter 6
PRML Chapter 6
 
混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)混合モデルとEMアルゴリズム(PRML第9章)
混合モデルとEMアルゴリズム(PRML第9章)
 
PRML輪読#7
PRML輪読#7PRML輪読#7
PRML輪読#7
 
PRML chapter7
PRML chapter7PRML chapter7
PRML chapter7
 
パターン認識 04 混合正規分布
パターン認識 04 混合正規分布パターン認識 04 混合正規分布
パターン認識 04 混合正規分布
 
VQ-VAE
VQ-VAEVQ-VAE
VQ-VAE
 
PRML輪読#11
PRML輪読#11PRML輪読#11
PRML輪読#11
 
PRML輪読#13
PRML輪読#13PRML輪読#13
PRML輪読#13
 
階層ベイズとWAIC
階層ベイズとWAIC階層ベイズとWAIC
階層ベイズとWAIC
 
PRML Chapter 12
PRML Chapter 12PRML Chapter 12
PRML Chapter 12
 
PRML輪読#8
PRML輪読#8PRML輪読#8
PRML輪読#8
 
PRML Chapter 7
PRML Chapter 7PRML Chapter 7
PRML Chapter 7
 
PRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワーク
PRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワークPRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワーク
PRML 5.5.6-5.6 畳み込みネットワーク(CNN)・ソフト重み共有・混合密度ネットワーク
 

Similar to PRML Chapter 10

PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4
Sunwoo Kim
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
Sunwoo Kim
 
PRML Chapter 3
PRML Chapter 3PRML Chapter 3
PRML Chapter 3
Sunwoo Kim
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1
Sunwoo Kim
 
Ann a Algorithms notes
Ann a Algorithms notesAnn a Algorithms notes
Ann a Algorithms notes
Prof. Neeta Awasthy
 
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
Waqas Tariq
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
kongara
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
Eka Puspita Sari
 
Diff eq
Diff eqDiff eq
Diff eq
ppvijith
 
Ijetr021210
Ijetr021210Ijetr021210
Ijetr021210
Ijetr021210Ijetr021210
Ijetr021210
ER Publication.org
 
A Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete ProblemsA Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete Problems
Brittany Allen
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
ijaia
 
Calculus ii power series and functions
Calculus ii   power series and functionsCalculus ii   power series and functions
Calculus ii power series and functions
meezanchand
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
Matthew L Levy
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
heba_ahmad
 
Quantum algorithm for solving linear systems of equations
 Quantum algorithm for solving linear systems of equations Quantum algorithm for solving linear systems of equations
Quantum algorithm for solving linear systems of equations
XequeMateShannon
 

Similar to PRML Chapter 10 (20)

PRML Chapter 4
PRML Chapter 4PRML Chapter 4
PRML Chapter 4
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
PRML Chapter 3
PRML Chapter 3PRML Chapter 3
PRML Chapter 3
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1
 
Ann a Algorithms notes
Ann a Algorithms notesAnn a Algorithms notes
Ann a Algorithms notes
 
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Chapter 3
Chapter 3Chapter 3
Chapter 3
 
Diff eq
Diff eqDiff eq
Diff eq
 
Unger
UngerUnger
Unger
 
Ijetr021210
Ijetr021210Ijetr021210
Ijetr021210
 
Ijetr021210
Ijetr021210Ijetr021210
Ijetr021210
 
A Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete ProblemsA Probabilistic Attack On NP-Complete Problems
A Probabilistic Attack On NP-Complete Problems
 
Dec 14 - R2
Dec 14 - R2Dec 14 - R2
Dec 14 - R2
 
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONGENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATION
 
Calculus ii power series and functions
Calculus ii   power series and functionsCalculus ii   power series and functions
Calculus ii power series and functions
 
Lar calc10 ch04_sec5
Lar calc10 ch04_sec5Lar calc10 ch04_sec5
Lar calc10 ch04_sec5
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 
Chapter 18,19
Chapter 18,19Chapter 18,19
Chapter 18,19
 
Quantum algorithm for solving linear systems of equations
 Quantum algorithm for solving linear systems of equations Quantum algorithm for solving linear systems of equations
Quantum algorithm for solving linear systems of equations
 

Recently uploaded

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 

Recently uploaded (20)

Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 

PRML Chapter 10

  • 1. Chapter 10 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics
  • 2. Chapter 10. Approximate Inference 2 What are we doing? In this chapter, we are doing various mathematical works. Here, we have to think keep in mind what we are doing. O.W. we just forget what we are doing! Rewind latent model which we covered in previous chapter. We should compute the expectation of complete log likelihood. For gaussian mixture model, this was tractable. However, many other models are very difficult to perform these tasks. 1. Continuous density should do integration. 2. Discrete density should do summation. 1st case can be intractable when its hard to compute closed form. 2nd case can be intractable when there exist infinite number of possible cases. e.g. Big-data with quadratic complexity. Under such conditions, there are two possible treatment. 1. Sampling methods (Chapter 11) 2. Deterministic approximation (In this chapter) In this chapter, we are going to study some approximate methods, which can be done by changing the functional form!
  • 3. Chapter 10.1. Variational Inference 3 Terminology and idea Before moving on, we have to talk about some basic terminologies related to this chapter. 1. Function : Mapping that takes the value of a variable as the input and returns the value of the function as the output. 2. Derivative of function : How the output value varies as we make infinitesimal(really small) changes to the input values. 3. Functional : Mapping that takes a function as the input and returns the value of the functional as the output. 4. Example of function is an entropy, 𝐻 𝑝 = 𝑝 𝑥 ln 𝑝(𝑥) 𝑑𝑥 5. Derivative of functional : Same as case 2, but input is a function and output is a functional. Now, let’s return to the EM algorithm. Here, 𝐿(𝑞) could be maximized by making KL zero, which indicates the equality of posterior and prior! Then, what should we do when it’s hard to compute posterior?? To make this feasible, we can restrict the form of prior, in order to make posterior computable and flexible! Yellow : Original distribution Red : Laplace approximation Green : Variational approximation
  • 4. Chapter 10.1. Variational Inference 4 Factorized distribution First, let’s begin with factorization method. In this example, we restrict the family of prior distribution 𝑞(𝑍). Here, main idea is ‘partitioning elements of variable 𝒁 to 𝑴 disjoint sets.’ Note that we are not restricting specific functional form of 𝑞𝑖(𝑍𝑖). We only assume they can be partitioned into disjoint sets. Then here, we can re-write 𝐿(𝑞) by following.
  • 5. Chapter 10.1. Variational Inference 5 Factorized distribution Under this condition, now suppose we are maximizing 𝐿(𝑞) while fixing 𝑞𝑖≠𝑗. Solution is easy since 𝑳(𝒒) is a negative KL between 𝒒𝒋 and 𝒑(𝑿, 𝒁𝒋) So, 𝒎𝒂𝒙𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑳 𝒒 → 𝒎𝒊𝒏𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑲𝑳(𝒒| 𝒑 → 𝒒 = 𝒑. Finally, we can get Note that const. makes this functional form a probability density function! Thus, we can simply normalize this function! Let’s revise once. Original EM has equation of 𝑞 𝑍 = 𝑝(𝑍|𝑋). This was clear pretty good, but if we computing posterior is intractable, we cannot move further. However, by changing equation to 𝒍𝒏 𝒒𝒋(𝒁𝒋) = 𝑬𝒊≠𝒋 𝒍𝒏 𝒑 𝑿, 𝒁 , we can update prior without computing posterior! Here, note the equation depends on other 𝑞𝑖 𝑍𝑖 . Thus, it cannot be done in one optimization. This should be done iteratively, updating every 𝑞𝑖 sequentially. Now, let’s see an example!
  • 6. Chapter 10.1. Variational Inference 6 Properties of factorized approximations We can see the characteristic of this method by studying example of multivariate gaussian. Here, we are going to express factorized gaussian with 𝑞 𝑧 = 𝑞1 𝑧1 𝑞2 𝑧2 By using the following equation, Luckily, you can see the right-hand side equation can be expressed by quadratic term! Which means, this equation follows gaussian distribution! By symmetry, Fortunately, this example is expressed in a closed form. Furthermore, as you can see, 𝐸 𝑧1 = 𝜇1, 𝐸 𝑧2 = 𝜇2. Thus, the mean of the estimated distribution fits the original distribution. This will be shown in a following figure.
  • 7. Chapter 10.1. Variational Inference 7 Properties of factorized approximations Secondly, there is a new approach, which changes the order of KL-divergence (from 𝐾𝐿(𝑞| 𝑝 → 𝐾𝐿(𝑝||𝑞)) Note that KL divergence is not a symmetry function! Equation can be written as KL 𝑝||𝑞 = 𝑝 𝑍 ln 𝒒(𝒁) 𝑑𝑍 + 𝑝(𝑍) ln 𝑝(𝑍) dZ Equation can be expressed in a closed form by This means prior distribution of 𝑞 𝑍𝑗 is equal to the marginal distribution of data itself, 𝑝(𝑍𝑖). 𝑲𝑳(𝒒||𝒑) 𝑲𝑳(𝒑||𝒒) 𝑲𝑳(𝒒||𝒑) 𝑲𝑳(𝒑||𝒒) Since expectation was same, it finds the local mean well.
  • 8. Chapter 10.1. Variational Inference 8 Univariate Gaussian Exmaple Now, let’s think of univariate gaussian example of mean 𝝁 and precision 𝝉, with given dataset 𝐷 = {𝑥1, 𝑥2, … , 𝑥𝑁} Here, likelihood function and conjugate prior can be given as Goal of this task is to find posterior distribution of these parameters! By using factorization, we can express pseudo-posterior by… Note that this expression is not a perfect posterior. Major framework Now, by using 𝒑 𝑫, 𝝁, 𝝉 = 𝒑 𝑫 𝝁, 𝝉 𝒑 𝝁 𝝉 𝒑(𝝉), we can estimate function of 𝒒 𝝁 , 𝒒 𝝉 , respectively. We don’t need 𝒑(𝝉) here, so throw it away!
  • 9. Chapter 10.1. Variational Inference 9 Univariate Gaussian Exmaple Major framework Interesting part is that we did not assume any function form of 𝑞(. ), but it satisfies the conjugate priors!! Here, 𝜇 and 𝜏 shares the parameter respectively. Thus, they should be estimated respectively with 𝐸 𝜇 Let’s see how this iterative optimization occurs!
  • 10. Chapter 10.1. Variational Inference 10 Univariate Gaussian Exmaple Please see how 𝒒𝝁(𝝁)𝒒𝝉(𝝉) fits the desired true posterior of 𝒑(𝝁, 𝝉|𝑫) In optimization formula, there exist some moment of parameters. They can be optimized by… Which is an UMVE of gaussian variance
  • 11. Chapter 10.1. Variational Inference 11 Model comparison This methodology can be applied to not only posterior distribution, but also to model comparison! Let index of each model 𝑚. Unfortunately, each model has different structure, we cannot simply express them in terms of factorization. Rather, we can use 𝑞 𝑍, 𝑚 = 𝑞 𝑍 𝑚 𝑞(𝑚) ** For convenience, we are assuming discrete variables (That’s why it has summation) From this, we can get the proportional relation of Rather than direct application, we optimize 𝐿 with individual 𝑞(𝑍|m) respectively. Then, we compute 𝑞(𝑚) and compare them!
  • 12. Chapter 10.2. Variational Mixture of Gaussians 12 Variational Mixture of Gaussians Let’s re-visit gaussian mixture example. Here, variational mixture approach can give great help to fully Bayesian treatment. Here, what does ‘fully Bayesian treatment’ mean? Consider parameters of 𝜋, 𝜇𝑘, Σ𝑘 in gaussian mixture model. We estimated them by using EM, but that was actually parametric approach, which means they are fixed constant. Here, we are talking about fully Bayesian approach, assuming specific distribution for each parameter. Let’s see how formula forms. Latent distribution Likelihood Now, we have to approximate their distribution! Prior distributions!
  • 13. Chapter 10.2. Variational Mixture of Gaussians 13 Variational distribution There are several difficulties in straight implementation. To make things easier, we are using variational method, that uses factorization skill! Key variational distribution is Note that they are not exact form. We are finding these approximations by two-stage, 1. Finding 𝑞(𝑍) 2. Finding 𝑞(𝜋, 𝜇, Λ) 1. Finding 𝒒⋆ (𝒁) We are only taking the distributions that are related to latent variable 𝑍. This form can be obtained simply By using these equations. Here, D is a dimensionality of data-matrix X
  • 14. Chapter 10.2. Variational Mixture of Gaussians 14 Variational distribution Here, please note that the form of variational distribution of 𝑍 is equivalent to that of prior. Furthermore, we can compute responsibility as Furthermore, for the simplicity, let’s define some variables.
  • 15. Chapter 10.2. Variational Mixture of Gaussians 15 Variational distribution Here, we can discover some interesting facts. 1. 𝜋 only relates to the 𝑍. 2. Other parameters are linked together. That is, we can decompose this variational distribution into 2. Finding 𝒒⋆ (𝝅, 𝝁, 𝚲)
  • 16. Chapter 10.2. Variational Mixture of Gaussians 16 Variational distribution For the 𝑀 − 𝑆𝑡𝑒𝑝, it requires estimation oof expectation of some random variables. (Which were denoted by 𝔼()) They can be written as Here, 𝜓(. ) denotes a digamma function, which is a log-derivative of gamma function. Finally, we can compute responsibility as
  • 17. Chapter 10.2. Variational Mixture of Gaussians 17 Variational distribution Now, let’s revise overall procedure! 1. We compute moments which we covered in previous slide and find estimated moment. (𝐸 − 𝑆𝑡𝑒𝑝) 2. Then, we update each posterior with newly calculated variables. (𝑀 − 𝑆𝑡𝑒𝑝) 3. Iteratively continue this process until converge. This variational Bayesian GMM has great strength that it automatically searches optimal number of components. This means a model sequentially abandons relatively weak component. Here, weak component indicates a component that contains relatively less number of data point! Note that as number of data (𝑁) increases, overall equation converges to that of basic gaussian mixture.(MLE solution). Still, this model has strength in overfitting and data collapsing.
  • 18. Chapter 10.2. Variational Mixture of Gaussians 18 Variational lower bound Consider various deep-learning model. We are printing loss of model every few intervals. Likewise, we can print lower bound to check whether model is being trained properly. Because lower bound (L(q)) should never be decreased! Here, 𝐻 indicates the entropy value of Wishart
  • 19. Chapter 10.2. Variational Mixture of Gaussians 19 Predictive distribution Since this is a clustering algorithm, it should be able to predict a newly observed data’s cluster label. Just like other Bayesian models, it marginalize out the parameters to obtain desired predictive distribution! I’ll skip this derivation of student t – distribution in this procedure. 𝑘 which yields the greatest probability can be chosen as the clustering result!
  • 20. Chapter 10.2. Variational Mixture of Gaussians 20 Determining the number of components Please check that clustering result is not a fixed one. It depends on the initialization of each prior values! Thus, we can define the optimal number of components by computing likelihood value of each number of components. This is inappropriate for basic GMM, since likelihood of GMM monotonically increases as the number of component increases! However, in Bayesian approach, it has intrinsic trade-off between model complexity and likelihood value, which acts as automatic regularization. Thus, comparing the 𝐿(𝑞) is reasonable in Bayesian GMM. Furthermore, I have mentioned Bayesian GMM has intrinsic power to select optimal number of components by setting 𝜋𝑘 converges to zero. This notion is called automatic relevance determination.
  • 21. Chapter 10.3. Variational Linear Regression 21 Fully Bayesian linear regression Now we are studying variational approach of linear regression. In previous Bayesian linear models, we did not assume the distribution of priors, which means 𝛼 in the 𝑝 𝑤 𝛼 = 𝑁 𝑤 0, 𝛼−1 𝐼) is a constant which can be optimized by evidence approximation. However, we are treating 𝛼 as random variable which contains its own distribution! Here we set distribution of 𝛼 by a gamma distribution. Joint distribution of parameters can be defined as Now, let’s find variational distribution of 𝑤 and 𝛼. By using basic notion of factorization method, Approximation can be done by Note that 𝑴 indicates the number of basis functions!
  • 22. Chapter 10.3. Variational Linear Regression 22 Fully Bayesian linear regression Since it has a form of gamma distribution, we can approximate this variational distribution by gamma! This way, we can find 𝑞(𝛼). Similarly, we can get a distribution of 𝑤. Note that the term of 𝑤 exist in a quadratic form, which indicates a gaussian distribution! Note that overall equation only differs at the 𝑬[𝜶] from chp 3. This expression becomes relatively similar to that of EM, when we set α0 = β0 = 0, which means we are not having any prior knowledge about the parameter 𝜶
  • 23. Chapter 10.3. Variational Linear Regression 23 Predictive distribution & Lower bound of variational linear regression Big idea is same. Just details are different. But here, we are using 𝐸[𝛼] instead of simple 𝛼 Similarly, we can evaluate lower bound to compare different models. Ground-Truth : 𝑋3 𝑥 − 𝑎𝑥𝑖𝑠 ∶ 𝑃𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑜𝑓 𝑋 𝑦 − 𝑎𝑥𝑖𝑠 ∶ 𝐿(𝑞) We can see it is being maximized in 𝑴 = 𝟑
  • 24. Chapter 10.4. Exponential Family Distributions 24 Exponential Family Distributions We have covered exponential family in mathematical statistics II and PRML chapter II. Let’s revise it for short. Note that 𝒈(𝜼) is a normalizing constant, which makes it as a probability! Why is it important? 1. Most of the distributions we know belong to this exponential family. 2. Distributions that belong to this family have conjugate priors. 3. Have sufficient statistics (UMVUE) 4. Posterior can be expressed as a closed form. Actually, we have covered various models in an exponential family distribution. But they were actually some special cases(examples). Thus, let’s study it under general framework of exponential family.
  • 25. Chapter 10.4. Exponential Family Distributions 25 Variational distribution Now, we have to define variational distribution of 𝑧. Here, 𝒒 𝒁, 𝜼 = 𝒒 𝒁 𝒒(𝜼). According to formula, we can re-write it by Variational distribution can be decomposed into each individual data E – Step : Computing 𝐸𝜂[𝑢(𝑥𝑛, 𝑧𝑛)] / Filling unseen-latent variable. M - Step : Updating each distributions. Furthermore, we can write this In variational message passing!
  • 26. Chapter 10.5. Local Variational Methods 26 Local approach of variational methods Note that all previous methods were ‘global approach’, that we have updated equation entirely. Now, we are going to use local approach, which updates part of the entire equation. Reason we are doing such method is to simplify the resulting distribution!! First, we discuss about convex function, which every chord lies on its functional value Here we get help of some max / min values. Let’s discuss it in example. Let’s consider function 𝑓 𝑥 = exp(−𝑥) which was drawn in red line. We are trying to approximate 𝑓(𝑥) with a simpler function. By using taylor series, we can re-write 𝑓(𝑥) by 𝑓 𝑥 ≈ 𝒚 𝒙 = 𝒇 𝝃 + 𝒇′ 𝝃 𝒙 − 𝝃 This is a tangent line of 𝑓(𝑥) at the point of (𝜉, 𝑓(𝜉)). As you can see in the left-hand side figure, every tangent line lies beneath the original function. Thus, such inequality satisfies. 𝑦 𝑥 ≤ 𝑓 𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑥 = 𝜉 Now, above tangent line can be re-written as 𝒚 𝒙 = 𝐞𝐱𝐩 −𝝃 − 𝐞𝐱𝐩(−𝝃)(𝒙 − 𝝃)
  • 27. Chapter 10.5. Local Variational Methods 27 Local approach of variational methods For generality, let’s assume 𝝀 = −𝒆𝒙𝒑(−𝝃). Equation can be re-written as 𝑦 𝑥, 𝜆 = 𝜆𝑥 − 𝜆 + 𝜆 ln(−𝜆) Now, let’s fix 𝑥. Then there might be various tangent line. Like right figure, tangent line at that point gives the maximum value of approximation, which is equal to the original functional value. Thus, this can be 𝒇 𝒙 = 𝒎𝒂𝒙 𝝀 (𝝀𝒙 − 𝝀 + 𝝀 𝒍𝒏 𝝀) Now, let’s move on to more general case of function! We fix 𝝀 and change 𝒙! We do not set any functional shape, we only guarantee the characteristic of convexity! Just like left figure, arbitrary line which passes the origin can be approximated to original function by moving a bit up(or down). That intercept value can be denoted as 𝑔(𝜆). It is a minimum vertical distance between original 𝑓(𝑥) and line 𝑦(𝑥). (By moving 𝑔(𝜆) it kisses the 𝑓(𝑥))
  • 28. Chapter 10.5. Local Variational Methods 28 Local approach of variational methods Thus, finding λ can be written in Now, let’s fix 𝒙 and change 𝝀 just as we did in the first example. Equation can be written as For the concave function, which looks like… (right figure) ** Easy way to discriminate them : Think of ‘cave’ and shape of it! Back to the point, we can apply same equations, only changing the side and sign of them. That is, max to min and min to max! Actually, we don’t need to memorize it. Just draw a simple function and simply think of it!
  • 29. Chapter 10.5. Local Variational Methods 29 Logistic sigmoid example Upper bound of logistic. Consider logistic sigmoid function, which is defined as 𝜎 𝑥 = 1 1+𝑒−𝑥. Note that it is neither concave nor convex. However, if we take logarithm, resulting value is concave. Therefore, we can derive It is binary cross entropy of binary classification model!! We then obtain upper bound by 𝐥𝐧 𝝈(𝒙) ≤ 𝝀𝒙 − 𝒈(𝝀) 𝝈 𝒙 ≤ 𝒆𝒙𝒑(𝝀𝒙 − 𝒈(𝝀)) That is, we can find the upper bound of logistic function! (The reason why we are finding upper/lower bound will be covered soon!) Lower bound of logistic. Deriving lower bound is a bit more complicated. Let’s re-write ln 𝜎(𝑥) by…
  • 30. Chapter 10.5. Local Variational Methods 30 Logistic sigmoid example Note that 𝒙 𝟐 is a monotonically increase function, and − ln(. ) part is a convex. Thus, we can approximate lower bound with 𝑓 𝑥 = − ln 𝑒 𝑥 2 + 𝑒− 𝑥 2, which is a convex function with respect to 𝑥2 . (Let’s skip the proof) By using general idea of convex function and max/min relation, we can compute Consider the kissing point as 𝜉. We can set derivative zero and rewrite equation! Finally, we can derive lower bound of logistic as follows. By using exponential function…
  • 31. Chapter 10.5. Local Variational Methods 31 Logistic sigmoid example Actually, this is the main part of this sub-section. Why are we doing such complicated works??? It’s because it can make intractable integration possible. Consider Bayesian predictive distribution! Note that 𝑝(𝑎) is a gaussian distribution! However, it was really hard to compute. Then, what if we replace 𝜎(𝑎) with its lower bound, 𝝈 𝒂 ≥ 𝒇(𝒂, 𝝃) Here, we have to choose 𝜉, which gives maximal value to 𝑭(𝝃)! Note that choice of 𝝃 depends on the choice of 𝒂. Let’s see how this idea is being used in the following section!
  • 32. Chapter 10.6. Variational Logistic Regression 32 Variational approximation of logistic Note that it was nearly impossible to find closed form of posterior / predictive distribution in logistic regression. Thus, we used Laplace approximation for computing such values. Now, we are going to use variational method to approximate them! As we did in previous section, we are going to find them by using lower bound. More precisely, maximizing the lower bound! By IID. Assumption, we can express Where 𝒂 = 𝑾𝑻 𝝓 Here, please remember how we solved lower bound of logistic!
  • 33. Chapter 10.6. Variational Logistic Regression 33 Variational approximation of logistic 1. Note that 𝑎(input of logistic) was 𝑊𝑇 𝜙. 2. Under variational setting, we assume each data are independent and we apply different 𝝃𝒏 for every data point! Thus, In order to achieve the computational convenience here, We put logarithm on each value! Here, we are trying to get 𝒒(𝒘). By this, we can induce…
  • 34. Chapter 10.6. Variational Logistic Regression 34 Variational approximation of logistic Here, we consider 𝜉 as a certain constant. Thus, we can sort right-hand side equation with respect to 𝑤. Note that 𝐰 is having a form of quadratic equation! Thus, we can assume 𝒘 follows Normal distribution!! There is one interesting part in this equation. That is, we can calculate this in a sequential manner! Why? Likewise, 𝑚𝑁 can also be computed in this way!
  • 35. Chapter 10.6. Variational Logistic Regression 35 Optimizing the variational parameters We use EM for parameter optimization of 𝑚, 𝑆 𝑎𝑛𝑑 𝜉. It would be beneficial to think of 𝑤 as a latent variable! Let’s revise EM for short. Note that this expectation was done with respect to 𝒘! Does not depend on 𝝃 Canceled out with following 𝟎. 𝟓𝑾𝑻𝝓 In order to find new 𝜉, we have to find derivative of this 𝑄 with respect to 𝜉.
  • 36. Chapter 10.6. Variational Logistic Regression 36 New variational parameter 𝝃𝒏 Let’s revise EM in variational logistic for short. 1. Initialize 𝜉𝑛 𝑜𝑙𝑑 2. Evaluate posterior distribution of q(𝑤) 3. Compute 𝜉𝑛 𝑛𝑒𝑤 Otherwise, we can simply(?) calculate 𝐿(𝑄) and set derivative of 𝜉 For the predictive distribution(which is most crucial in logistic regression), Then we can use same techniques after this, as we did in chapter 4.
  • 37. Chapter 10.6. Variational Logistic Regression 37 Inference of hyper parameters There was a prior variance 𝑆0. We have assumed that we know that value, but now we treat it as a random variable. This is also a fully Bayesian treatment! Consider prior distribution is given as 𝑝 𝑤 𝛼 = 𝑁(𝑤|0, 𝛼−1 𝐼) As usual, we set distribution of 𝛼 as a conjugate gamma. Which is 𝑝 𝛼 = 𝐺𝑎𝑚𝑚𝑎(𝛼|𝑎0, 𝑏0) Marginal model likelihood and joint distribution are given as As we have done so-many times, we introduce variational distribution 𝒒(𝒘, 𝜶). To perform expectation maximization, we can again compute marginal likelihood function by However, note that 𝐿(𝑞) is intractable due to existence of 𝑝(𝑡|𝑤). Thus, we apply lower bound here.
  • 38. Chapter 10.6. Variational Logistic Regression 38 Inference of hyper parameters By using the fact of After it, let’s compute 𝒒 𝒘, 𝜶 = 𝒒 𝒘 𝒒(𝜶) indivually! We can compute lower-bound as follows Why Normal? Because 𝒘 exists in a quadratic form under log function! Why Gamma? Because 𝜶 exists in a 1st order form under log function!
  • 39. Chapter 10.6. Variational Logistic Regression 39 Inference of hyper parameters Furthermore, we have to compute 𝜉 for full model! For expectation values, we can compute as follows! Let’s revise variational approach once again. 1. Sometimes its hard to compute posterior and integration of certain functions. 2. We assume variational structure, which considers one variable with one latent value or parameter. 3. It can be broken into two parts like 𝑞(𝑤, 𝛼). 4. Reason why we are doing such approximation is because posterior or other integration is intractable under full-equation. 5. This is helpful in fully-Bayesian model, which also assumes distribution for hyper-priors. 6. Note that optimization of such models can be done by EM algorithm. Let’s skip expectation propagation. We’ll cover it if we have time… (End of summer break is coming… Last semester is about to begin…!)