2. Framework –Bayesian Inference
• The inputs:
1. Sample of length n (numbers, categories, vectors, images)
We denote this entity–Evidence
2. An assumption about the probabilistic structure that generates the sample –Hypothesis
Posterior =P(H|E)
Objective : GainUpdate information about the Hypothesis using the Evidence
3. Bayesian Inference- into formulas
• Estimating hypothesis upon the evidence:
• Z,X random variables
We wish to have P(Z|X) .
• Bayes formula:
P(Z|X) =
𝑃(𝑍,𝑋)
𝑃(𝑋)
Bayesian inference is therefore about working with the RHS terms.
4. RHS terms
• Hidden Variables (Hypothesis)(Z)- The variables of the mechanism that
generates the sample
(e.g. topics distribution in a corpus or the Gaussians in GMM)
1. The values are not given
2. We have the joint distribution P(Z,X) !!!
• Observed Data (Evidence) (X)- The sample that we actually have.
1. We know every value
2. We may know nothing about its distribution
5. RHS terms (Cont.)
• In some case studies P(X) is intractable or extremely difficult to
calculate.
• We cannot obtain the conditional distribution based on Bayes
formula’s terms
• Variational inference offers a class of algorithms to solve this
problem: Approximating posterior for difficult P(X)
6. Examples:
GMM (Known # Gaussian & Variance)
We have K Gaussians
Draw μ 𝑘 ~ 𝑁 0, τ (τ is positive)
For each sample j =1…n
𝑧𝑗 ~Cat (1/K,1/K…1/K)
𝑥𝑗 ~ 𝑁(μ 𝑧 𝑗
, σ)
p(𝑥1….𝑛) = μ1:𝑘 𝑙=1
𝐾
𝑃(μ𝑙) 𝑖=1
𝑛
𝑧 𝑗
𝑝( 𝑧𝑗) P(𝑥𝑖|μ 𝑧 𝑗
) => 𝑃𝑟𝑒𝑡𝑡𝑦 𝑆ℎ𝑖𝑡 !
7. Examples - LDA
Corpus D every document of length N
N ∼ Poisson(ξ)
θ ∼ Dir(α).
Β -Topics (array of words ,fixed or Dirichlet)
For each of the N words 𝑤 𝑛:
a topic 𝑧 𝑛 ∼ Cat(θ).
𝑤 𝑛 ~ p(𝑤 𝑛|𝑧 𝑛, β)
β𝑖𝑗 = P(𝑤𝑖|𝑧𝑗 )
p(w|α, β) = 𝑃(θ|α) 𝑖=1
𝑛
𝑧 𝑛
(p(𝑧 𝑛|θ)p(wn|𝑧 𝑛, β))dθ
8. Sampling
• The common solution for estimating distributions is sampling:
1. MCMC
• Metropolis-Hastings
• Gibbs
2 RBM –(Mostly by Gibbs sampling)
3. Hybrid Monte Carlo
• Today we wont talk about these methods!
9. Sampling Vs. Analysis
• Sampling
1. The solutions are exact
2. Numerically expensive
Deterministic
1. Solutions are cheaper
2. Less accurate
3. Non-conjugate problem
4. An optimization process
10. Sampling Vs. Analysis cont.
• MCMC methods are good for small data where accuracy is essential
• When we have big data and many modes should be tested ,VI
methods have an advantage
11. Can we do something analytically?
• Can we analytically approximate the posterior ?!
• Can we find a distribution that is closed to the posterior and well estimate the distance?
• When the framework is a vector space
1. Calculus –Allows us to find extremums easily
2. We are endowed with 𝐿 𝑃 metrics (typically p=1 , 2)
• Our domain is the functions space and their functional
We need:
1. An analytical method to find functional’s extremums
2. Nice metric
12. Calculus of Variations
• Consider the following
𝐹, 𝑦 functions-(with all the “extras”)
J(y)= 𝐹( 𝑦, 𝑦′
𝑡)𝑑𝑡 (𝑦 𝑖𝑠 𝑑𝑖𝑓𝑓. )
If y is an extremum of J it satisfies Euler-Lagrange eq.
𝑑𝐹
𝑑𝑦
-
𝑑
𝑑𝑡
(
𝑑𝐹
𝑑𝑦′
) = 0
• Example: maximum entropy principle
Generally speaking this domain is a calculus for functional spaces hence it is
beneficial for optimizations
13. Calculus of Var. Cont.
The fundamental lemma of Calculus of variations:
If M continuous, and for all h differentiable
𝑎
𝑏
𝑀 𝑥 ∗ ℎ 𝑥 = 0 ⟹ 𝑀 ≡ 0 on (a,b) (Chybenko)
Generally speaking this domain is a calculus for functional spaces hence
it is beneficial for optimizations
14. KL (Kullback-Leibler) Divergence
• A metric on distributions
* “On Information and Sufficiency” 1951 (Ann Math Statist)
Properties:
1. Non-symmetric (It actually measures a relative distance :which distribution
P observes as the closest)
1. Concave -> 0 is obtained only for Kl(p,p) (proof by concavity of log an Jensen Lagrange
multipliers))
2. The distance between P(x,y) to p(x) *p(y)=0
Usage:
Cross Entropy = H(p)+ KL(p,q)
15. PMI (Pointwise Mutual Information)
• Let X,Y random variables
• PMI(X,Y)=Log[
𝑃(𝑋=𝑎,𝑌=𝑏)
𝑃 𝑋=𝑎 𝑃(𝑌=𝑏)
]
• KL(p(X/Y=a),Q(x)) = 𝑥 𝑃(𝑋 = 𝑥|𝑌 = 𝑎)PMI(X=x, Y=a)
• What does this term mean?
16. ELBO- Evidence Lower Bound
Consider now P(X) –The Evidence
We have :
log(P(X)) ≥ 𝐸 𝑄 [log p(x, Z)] − 𝐸 𝑄 [log Q(Z)]
The RHS is called ELBO and it is a lower bound of the LHS
17. Back to KL
• Having the requested analytical tools we can approximate the
posterior: find Q s.t. Q(Z) ~ P(Z|X) :
• min(KL(Q(Z)||P(Z|X) )
KL(Q||P(Z|X) )= Log(P(X))- ELBO
=>Log(P(X)) = KL(Q||P(Z|X) ) +ELBO
P is fixed Hence: Maximizing ELBO =>minimizing KL
18. Let’s use Calculus!
• We wish to optimize the ELBO term.
We can define a functional :
𝐸𝐿𝐵𝑂 = 𝐸 𝑄[log p(X, Z)] − 𝐸 𝑄 [log Q(Z)] = 𝑄𝐿𝑜𝑔(
𝑃(𝑋,𝑍)
𝑄(𝑍)
)= J(Q)
We can go to Euler –Lagrange here, but let’s try and simplify Q!
19. Mean Field Theory-MFT
• The main idea is solving many-body problem (Ising model)
Assume system of many bodies (atoms ,other particles)
1. For each body replace its interaction particles with their average.
2. Assume no correlations between interacted bodies
We will use section 2 to simplify Q
Q(z) = 𝑖=1
𝑛
𝑞𝑖(𝑧𝑖) (Obviously not true)
20. MFT –cont.
• We can use now Euler –Lagrange with the constrain
𝑞𝑖(z) =1
• We obtain
L𝑜𝑔(𝑞𝑖) = 𝑐𝑜𝑛𝑠𝑡 + 𝐸−𝑖[𝑝 𝑥, 𝑧 ] Bolzman Dist.!
Did we win ? No!
Note that each 𝑞𝑖 may change other 𝑞 𝑗′ 𝑠
21. Coordinate Ascent Variational Inference
CAVI
• An iterative algorithm
1. Construct a model P(X,Z)
2. Set sequentially each 𝑞𝑖 to 𝐸−𝑖[𝑝 𝑥, 𝑧 ] +constant
3. As always we repeat until the q’s converge
(Wikipedia,Blei) https://www.youtube.com/watch?v=uKxtmkfeuxg
“Message passing” – Winn & Bishop
Minka 2005, Knowles & Minka 2011
23. Gaussian Cont.
• P(X|τ, μ) = 𝑖=1
𝑛
𝑁(𝑥 𝑛|τ, μ)
• P(μ| τ) = N(μ|μ0,(λ0τ)−1)
• P(τ)= 𝐺𝑎𝑚𝑚𝑎 (τ|𝑎0, 𝑏0)
• MFT implies:
q(μ, τ ) = 𝑞 μ 𝑞 τ (Not that accurate in this case !)
Using ELBO formula:
Ln(𝑞 μ )= 𝐸τ[ln(P(X|τ, μ))+ln(P(μ| τ) +ln(P(τ))]+C
Ln(𝑞 τ )= 𝐸μ[ln(P(X|τ, μ))+ln(P(μ| τ) +ln(P(τ))]+C
24. Stochastic - VI
• CAVI does not work well for big data (update for every item)
• Stochastic VI- rather updating the q’s, we calculate the gradient of the
ELBO, and optimize its parameters (similar to EM)
• Used in LDA applications (David Blei et al)
• http://www.columbia.edu/~jwp2128/Papers/HoffmanBleiWangPaisle
y2013.pdf
• https://www.cs.princeton.edu/courses/archive/fall11/cos597C/readin
g/Blei2011.pdf
•
25. Appendix- Ising Model
• Ferromagnetism (Pierre Weiss )
• Ising Model -Lenz & Ernst Ising
• We have a Hamiltonian
H (σ) = -h 𝑥 𝜎𝑥 − 𝑗 𝑦𝑥 𝜎𝑥 𝜎 𝑦
(𝜎𝑥 -the spin of a site (atom) y,x are nearest neighbors hence the sum is over
adjacent spins, h is the magnetic field and j is the “coupling constant”)
• Consider the contribution of a single atom (spin):
ξ(𝜎𝑥) = -h𝜎𝑥 -j𝜎𝑥 𝑦 𝜎 𝑦
(y runs over the near spins of x)
26. Ising Model(cont.)
• Now we replace the second summation by its mean :
ξ(𝜎𝑥) = -h𝜎𝑥 - j𝜎𝑥 < 𝜎 𝑦 > We obtain
ξ(𝜎𝑥) = −ℎ0 𝜎𝑥
• Note that if we are use this approximation to average the entire
system we can use this approximation to have:
𝐸 𝑚𝑓 =𝐸0-h 𝑥 𝜎𝑥
The solution single Bolzman spin dist.:
P(𝑠𝑖) = 𝑒 𝑎∗𝑠 𝑖 /(𝑒 𝑎∗𝑠 𝑖 +𝑒−𝑎∗𝑠 𝑖)
27. Remarks
1 Maxwell speeds – The use of independency for “achieving” normal
distribution
2 RBM
3 Conditional Random Field (CRF)
4. Cybenko., G. (1989) "Approximations by superposition of sigmoidal
functions“
5. Kullback & Leibler “On Information and Sufficiency”
6. David Blei – Latent Dirichlet Allocations (and the rest of his papers)
7. Expected maximization algorithm (EM, Baum-Welch)
29. VI –Other Languages.
• R- https://artax.karlin.mff.cuni.cz/r-help/library/varbvs/html/00Index.html
• R - https://cran.r-project.org/web/packages/varbvs/varbvs.pdf
• R - https://github.com/kieranrcampbell/clvm (claim that they implement
CAVI )
• Blog on Scala http://alexminnaar.com/online-latent-dirichlet-allocation-
the-best-option-for-topic-modeling-with-large-data-sets.html
• Spark mllib -
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
pache/spark/mllib/clustering/LDAOptimizer.scala