SlideShare a Scribd company logo
1 of 14
Download to read offline
Deep Variational Inference for q with Gamma Distribution
Author: Natan Katz
Bayesian Inference
In the world of ML, Bayesian inference is often treated as the peculiar enigmatic uncle. On one
hand Bayesian inference offers a massive exposure to theoretical scientific tools from
mathematics statistics and physics. Furthermore it carries an impressive historical legacy of
scientific breakthroughs. On the other hand, will tell you nearly every data scientist: it hardly
ever works.
This post is comprised of two parts:
1. A brief survey on the appearance of variational inference (VI)
2. A description of my trials to solve a VI problem using DL techniques for data with Beta
likelihood
The Bayesian problem
Bayesian problem can be described as follow: We have an observed data X
𝑥1, 𝑥2 , 𝑥3 , … 𝑥 𝑁, where data can be numbers, categories or any other type of data that one
can imagine. We assume that this data has been generated using a latent variable Z. We have
four distributions:
1. P(Z) The prior distribution of the latent variable
2. P( X| Z) The likelihood – The distribution type of this function is determined by the data,
(for example: if we have integers we may think of Poisson ,if we have positive numbers we
may use Gamma).
3. P(X) – The distribution of the observed data.
4. P(Z|X) The posterior distribution - The probability to have a value of Z given a value of X
These distributions are bound together by Bayes formula :
P(Z|X) =
𝑃(X|Z)𝑃(Z)
𝑃(𝑋)
The Bayesian problem is therefore about finding the posterior distribution and the values of Z.
The obstacle is that in real life models, P(X) is seldom tractable.
Until 1999 the modus operandi to solve Bayesian problems was sampling. Algorithms such as
Metropolis Hasings or Gibbs used successfully for resolving posterior functions in wide scientific
domains. Although they worked well they suffered from two significant problems:
• High variance
• Slow convergence
In 1999 Michael Jordan published a paper "An introduction to variational graphic model" (This
is the year that the more famous MJ quitted, which raises the question whether universe is
governed by "MJ preservation law")
In this paper he described an analytical solution to Bayes problem. In the core of this solution
he suggested that rather chasing the posterior function with sampling, we can introduce a
distribution function q of the latent variable Z . By setting apriorically q's family functions and a
metric we can approximate the posterior function using this q. For this solution he used Euler
Lagrange equation which is a fundamental equation in Calculus of Variations ,that brought the
notion: Variational Inference (VI).
What is variational inference?
In this section we describe VI in more details. Recall that we aim to approximate a posterior
function P(Z|X) using a function q which is the distribution of Z. The metric that Jordan
suggested was KL divergence (https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729694)
This idea of analytical solution reduces the obstacles of high variance and the slow
convergence. On the other hand, since we set a family of functions we introduce some bias. The
following formula illustrates the offered solution:
The LHS does not depend on Z therefore it can be considered as a constant. Minimizing KL
divergence is equivalent to maximizing the second term: "Evidence Lowe Bound" (ELBO)
As a function with pre-defined shape, q has constants that are denoted by λ.
We can consider VI as an ELBO maximization problem:
Physics!
The motivation of Jordan to use this solution came from works of Helmohltz and Boltzmann in
thermo dynamics. The ELBO function is extremely similar to Helmholtz free energy where the
first term is the energy and second is the entropy of q. Maximizing the ELBO is about increasing
the energy and reducing the entropy. Jordan used mean field theorem (MFT) from Ising model,
there magnetic spins are assumed to be uncorrelated which simplified the way to handle q.
Ising problem has an exponential solution (Boltzmann distribution), which became the common
choice for the shape of q
VI – Example
I will now give an example of how one formulate a VI problem. We will assume that we have a
data of real numbers with a Gaussian likelihood. The latent variable Z , is therefore the pair
Z ={̢μ, τ }
Where the setting of the problem is as follow:
Using mean field theorem, q can be written as a product of q (μ, τ) =𝑞μ 𝑞T
Where
A detailed formulation can be found here https://towardsdatascience.com/variational-inference-in-
bayesian-multivariate-gaussian-mixture-model-41c8cc4d82d7)
One can see that such formulation requires massive analytical calculations for every model. We
will discuss this in one of the sections below when we will talk about BBVI.
VI has been strongly boosted in 2002 when Blei published his "Latent Dirichlet allocation",
where he used VI for topics extraction
There are several available tools to be used for VI such as Edward, Stan, PyMC3 and Pyro
Using VI for distribution tail's Inference
The problem I had to handle was identifying a tail of an observed data. The data itself is a set of
probability values thus we assume that the likelihood has a Beta distribution. Before I discuss
the modeling steps I will give an example for tail's typical problem.
Let's assume that we have a model that predicts a disease with accuracy of 99%. We can
assume that 0.1% of the population is sick. Each day 100,000 people are tested. 990 healthy
people will hear that they are sick. This mistake is costly, since they are going to do medical
procedures. Such problems occur mainly because models are often trained to optimize
accuracy under symmetric assumptions. If we rely on our models, tiny changes in the threshold
may cause huge damages. We need therefore to understand the 1% tail as accurately as
possible.
VI as analytical solution seems to be an interesting tool for achieving this purpose. Surprisingly I
found out that the world of DL didn't massively embraced VI, which indeed encouraged me to
walk on this way.
Deep learning for Variational Inference
In this part I will describe my trials to use DL for solving VI problems.
Recall that our data's likelihood has a Beta distribution. This distribution has an extremely
difficult prior. We therefore take Gamma as a prior since it has a support on the positive real
line. In VI terminology Gamma is the prior and q distribution. Hence we train Gamma to
sample Z which itself is the pair {α, β} of Beta.
Mixture Density Networks -MDN
The first framework to be considered was MDN (Mixture Density Networks) . This class of
networks that proposed by Bishop (http://publications.aston.ac.uk/id/eprint/373/) aims to learn
parameters of a given distribution. We modify the loss: rather using likelihood we use ELBO.
In all the plots, red represents our VI based sample and blue is a synthetic sample that has
been created by a random engine.
Here is atypical outcome of our MDN trials:
The VI based sample is relatively close to the synthetic sample. However, it is definitely
improvable. Moreover while in the inner part and the left side (which is the interested tail from
my perspective) are fairly similar, in the upper tail we suffer from huge differences.
We wish to try additional tools.
Reinforcement Learning – Actor Critic
Variational inference is often interpreted as a reinforcement learning problem (see
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/viral.pdf) . We consider the q
function as the policy function, Z is the option and the reward is the ELBO. When we use RL
tools for such problem, we must notice that two modifications are needed:
• We don’t have an episode -We have IID samples
• The options are not discrete but Gamma distributed
In order to resolve the first clause we have to modify the Q learning update formula:
As there is no real episode one has to remove the second term, since it indicates the future
expected rewards.
Regarding the second clause, we have to train a Gamma form policy function (a nice
motivation exists here http://proceedings.mlr.press/v70/chou17a/chou17a.pdf)
In order to train policy function we will use actor-critic with experience replay
https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f
and add our modifications.
These are the plots that we got:
The outcomes are fairly good in an interval[0,1-a) for a small positive. We totally fail to
model the upper tail.
Black-BOX Variational Inference
The appearance of VI provided new directions for studying posterior distributions.
Nevertheless it imposed a massive analytical computations of the expectations (see the
examples of Gaussian dist. Above). There was a clear objective to find a more generic
scheme that reduce the amount of analysis and better handles huge amount of data. The
slide below is Blei's description for this need:
The offered solution used stochastic optimization. A class of algorithms that has been
presented by Robbins and Monro (A stochastic Approximation method 1951)
Robbins-Monro Algorithm
The Robbins Monro algorithm aims to solve a root problem: Let F a function and α a
constant. We assume a unique solution of the eq:
F(𝑥) = α
We wish to find the 𝑥∗
such that solves this equation.
Assume that F is not observable but there exists a random variable T such that
𝐸[𝑇(𝑥)] = 𝐹(x)
Robbins and Monro have shown that for certain regulations the following algorithm
converges with 𝐿2
:
𝑥𝑡+1 = 𝑥𝑡 - 𝑎 𝑛 (T(𝑥𝑡) – α)
Where
𝑎1, 𝑎2 ………..𝑎 𝑛 are positive numbers They satisfy:
∑ 𝑎𝑖
∞
𝑖=0 = ∞ , ∑ 𝑎𝑖
2∞
𝑖=0 < ∞
This sequence is called Robbins Monro sequence.
Back to BBVI
In a paper from 2013 https://arxiv.org/pdf/1401.0118.pdf ,Blei suggested two steps
improvement for VI by using Robbins-Monro sequence. This algorithm has been denoted
"Black Box Variational Inference" (BBVI). The first step is presented below:
The main idea is to construct a Monte Carlo estimator to the ELBO gradient and use Robbins
Monro sequence as described in the section beyond. This algorithm (as many other Monte
Carlo algorithms) suffers from high variance. Blei suggested two methods to reduce this
variance:
• Control Variates – for the estimator 𝑓^
he searches a function g such that:
𝐸 𝑞 [𝑓^
] = 𝐸 𝑞 [ 𝐠] and 𝑉𝑎𝑟𝑞 [𝑓^
] < 𝑉𝑎𝑟𝑞 [ 𝐠]
2 Rao- Blackwell – Rao Blackwell theorem claims that if X is an estimator of Z and T is a
sufficient statistics then for Y = E(X|T)
E[(𝑌 − 𝑍)2 ] ≤ E[(𝑋 − 𝑍)2 ]
In the paper Blei uses Rao Blackwell to create an estimator for each value of λ𝑖 by using the
Rao Blackwell property for conditional distribution.
Now Plots
I used the BBVI without the variance reduction techniques .
We can see that we reduced the tail issues in comparison to AC, but still suffer from this
problem. In the last plot we did not have tail issues but the approximation became weaker
Results Summary
We saw that using Deep learning tools for VI works with partial success. We performed fairly
good approximation within the interval but fail to approximate the upper tail. Moreover in all
the frameworks I suffered a certain level of instability.
There are several plausible reasons:
DL- Issues
• Architecture is not sophisticated enough
• More Epochs
Gamma- Beta issues
• Traditionally, DL problems are studied on common distributions such as Gaussian or
Uniform. These distributions don't have parameters dependent Skewness and Kurtosis.
This is not the case in Gamma, where the parameters that are trained by the engine
fully determine the Skewness and Kurtosis. It may require different techniques
• We use ELBO as loss function. This function relies strongly on the prior shapes. Gamma
is not the real prior of Beta. Which may suggest that loss modifications are required
• Beta has a compact support. The fact that we have issues only near the tail (sort of a
"singular point" of the distribution) may raise obstacles. This "singularity" issues are to
be studied
We can summarize that these experiments have shown that VI can be resolved using DL
mechanism. However, it requires further work. Nevertheless, we got some clarifications for the
enigmatic uncle.
A torch code that mimics my trials (the real code has been used for some commercial work I
did, sorry ) exists here https://github.com/natank1/DL-_VI
Acknowledgements
I wish to acknowledge Uri Itai for fruitful discussions before and during the work
and for his comprehensive reviewing, to Shlomo Kashani and Yuval Shachaf for
their ideas and comments and to Leonard Newnham, for clarifying issues in the
reinforcement learning world
References
• http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
• https://people.eecs.berkeley.edu/~jordan/papers/variational-intro.pdf
• https://towardsdatascience.com/a-hitchhikers-guide-to-mixture-density-networks-76b435826cca
• https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004.pdf
• https://www.ri.cmu.edu/wp-content/uploads/2017/06/thesis-Chou.pdf
• http://www.michaeltsmith.org.uk/#
• https://github.com/trevorcampbell/ubvi/tree/master/examples
• https://arxiv.org/pdf/1811.01132.pdf
• http://proceedings.mlr.press/v70/chou17a/chou17a.pdf
• https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f
• https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-
the-hedgehog-86d6240171d/
• http://incompleteideas.net/book/the-book-2nd.html
• https://arxiv.org/pdf/1401.0118.pdf
• https://towardsdatascience.com/variational-inference-in-bayesian-multivariate-gaussian-mixture-
model-41c8cc4d82d7
• https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729586
• http://edwardlib.org/tutorials/klqp
• https://mc-stan.org/users/interfaces/
• https://github.com/stan-dev/pystan/tree/develop/pystan
• https://docs.pymc.io/notebooks/getting_started.html
• https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729694
• https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf
• http://publications.aston.ac.uk/id/eprint/373/

More Related Content

What's hot

Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityChristian Robert
 
11.final paper -0047www.iiste.org call-for_paper-58
11.final paper -0047www.iiste.org call-for_paper-5811.final paper -0047www.iiste.org call-for_paper-58
11.final paper -0047www.iiste.org call-for_paper-58Alexander Decker
 
Lecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningLecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningzukun
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABCChristian Robert
 

What's hot (8)

Statistics symposium talk, Harvard University
Statistics symposium talk, Harvard UniversityStatistics symposium talk, Harvard University
Statistics symposium talk, Harvard University
 
Lesson 26
Lesson 26Lesson 26
Lesson 26
 
Teste
TesteTeste
Teste
 
Thinking with shapes
Thinking with shapesThinking with shapes
Thinking with shapes
 
11.final paper -0047www.iiste.org call-for_paper-58
11.final paper -0047www.iiste.org call-for_paper-5811.final paper -0047www.iiste.org call-for_paper-58
11.final paper -0047www.iiste.org call-for_paper-58
 
Lecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learningLecun 20060816-ciar-01-energy based learning
Lecun 20060816-ciar-01-energy based learning
 
4515ijci01
4515ijci014515ijci01
4515ijci01
 
from model uncertainty to ABC
from model uncertainty to ABCfrom model uncertainty to ABC
from model uncertainty to ABC
 

Similar to Deep VI with_beta_likelihood

Introduction of Inverse Problem and Its Applications
Introduction of Inverse Problem and Its ApplicationsIntroduction of Inverse Problem and Its Applications
Introduction of Inverse Problem and Its ApplicationsKomal Goyal
 
Equations Over Local Monodromies
Equations Over Local MonodromiesEquations Over Local Monodromies
Equations Over Local Monodromiessuppubs1pubs1
 
Machine learning (4)
Machine learning (4)Machine learning (4)
Machine learning (4)NYversity
 
Learning group variational inference
Learning group  variational inferenceLearning group  variational inference
Learning group variational inferenceShuai Zhang
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4VuTran231
 
An Implicit Cover Problem In Wild Population Study
An Implicit Cover Problem In Wild Population StudyAn Implicit Cover Problem In Wild Population Study
An Implicit Cover Problem In Wild Population StudyMichele Thomas
 
Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...butest
 
Philosophy chapter iii_ok10
Philosophy chapter iii_ok10Philosophy chapter iii_ok10
Philosophy chapter iii_ok10Irma Fitriani
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1Sunwoo Kim
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dMerrileeDelvalle969
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdfGieTe
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data scienceBrad Klingenberg
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsNBER
 

Similar to Deep VI with_beta_likelihood (20)

Introduction of Inverse Problem and Its Applications
Introduction of Inverse Problem and Its ApplicationsIntroduction of Inverse Problem and Its Applications
Introduction of Inverse Problem and Its Applications
 
Equations Over Local Monodromies
Equations Over Local MonodromiesEquations Over Local Monodromies
Equations Over Local Monodromies
 
Machine learning (4)
Machine learning (4)Machine learning (4)
Machine learning (4)
 
Learning group variational inference
Learning group  variational inferenceLearning group  variational inference
Learning group variational inference
 
Cs229 notes4
Cs229 notes4Cs229 notes4
Cs229 notes4
 
An Implicit Cover Problem In Wild Population Study
An Implicit Cover Problem In Wild Population StudyAn Implicit Cover Problem In Wild Population Study
An Implicit Cover Problem In Wild Population Study
 
Logistics regression
Logistics regressionLogistics regression
Logistics regression
 
Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...Bayesian Inference: An Introduction to Principles and ...
Bayesian Inference: An Introduction to Principles and ...
 
Large Deviations: An Introduction
Large Deviations: An IntroductionLarge Deviations: An Introduction
Large Deviations: An Introduction
 
Philosophy chapter iii_ok10
Philosophy chapter iii_ok10Philosophy chapter iii_ok10
Philosophy chapter iii_ok10
 
nber_slides.pdf
nber_slides.pdfnber_slides.pdf
nber_slides.pdf
 
PRML Chapter 1
PRML Chapter 1PRML Chapter 1
PRML Chapter 1
 
For this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The dFor this assignment, use the aschooltest.sav dataset.The d
For this assignment, use the aschooltest.sav dataset.The d
 
nonlinear_rmt.pdf
nonlinear_rmt.pdfnonlinear_rmt.pdf
nonlinear_rmt.pdf
 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
Lesson 31
Lesson 31Lesson 31
Lesson 31
 
AI Lesson 31
AI Lesson 31AI Lesson 31
AI Lesson 31
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 
Linear models for data science
Linear models for data scienceLinear models for data science
Linear models for data science
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 

More from Natan Katz

AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptxNatan Katz
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUPNatan Katz
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Natan Katz
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL DivergenceNatan Katz
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference projectNatan Katz
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference Natan Katz
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Variational inference
Variational inference  Variational inference
Variational inference Natan Katz
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesNatan Katz
 

More from Natan Katz (17)

final_v.pptx
final_v.pptxfinal_v.pptx
final_v.pptx
 
AI for PM.pptx
AI for PM.pptxAI for PM.pptx
AI for PM.pptx
 
SGLD Berlin ML GROUP
SGLD Berlin ML GROUPSGLD Berlin ML GROUP
SGLD Berlin ML GROUP
 
Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs Ancestry, Anecdotes & Avanan -DL for Amateurs
Ancestry, Anecdotes & Avanan -DL for Amateurs
 
Cyn meetup
Cyn meetupCyn meetup
Cyn meetup
 
Finalver
FinalverFinalver
Finalver
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
 
Quant2a
Quant2aQuant2a
Quant2a
 
Bismark
BismarkBismark
Bismark
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
NICE Research -Variational inference project
NICE Research -Variational inference projectNICE Research -Variational inference project
NICE Research -Variational inference project
 
NICE Implementations of Variational Inference
NICE Implementations of Variational Inference NICE Implementations of Variational Inference
NICE Implementations of Variational Inference
 
Ucb
UcbUcb
Ucb
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Neural ODE
Neural ODENeural ODE
Neural ODE
 
Variational inference
Variational inference  Variational inference
Variational inference
 
GAN for Bayesian Inference objectives
GAN for Bayesian Inference objectivesGAN for Bayesian Inference objectives
GAN for Bayesian Inference objectives
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 

Deep VI with_beta_likelihood

  • 1. Deep Variational Inference for q with Gamma Distribution Author: Natan Katz Bayesian Inference In the world of ML, Bayesian inference is often treated as the peculiar enigmatic uncle. On one hand Bayesian inference offers a massive exposure to theoretical scientific tools from mathematics statistics and physics. Furthermore it carries an impressive historical legacy of scientific breakthroughs. On the other hand, will tell you nearly every data scientist: it hardly ever works. This post is comprised of two parts: 1. A brief survey on the appearance of variational inference (VI) 2. A description of my trials to solve a VI problem using DL techniques for data with Beta likelihood The Bayesian problem Bayesian problem can be described as follow: We have an observed data X 𝑥1, 𝑥2 , 𝑥3 , … 𝑥 𝑁, where data can be numbers, categories or any other type of data that one can imagine. We assume that this data has been generated using a latent variable Z. We have four distributions: 1. P(Z) The prior distribution of the latent variable 2. P( X| Z) The likelihood – The distribution type of this function is determined by the data, (for example: if we have integers we may think of Poisson ,if we have positive numbers we may use Gamma). 3. P(X) – The distribution of the observed data. 4. P(Z|X) The posterior distribution - The probability to have a value of Z given a value of X These distributions are bound together by Bayes formula : P(Z|X) = 𝑃(X|Z)𝑃(Z) 𝑃(𝑋) The Bayesian problem is therefore about finding the posterior distribution and the values of Z. The obstacle is that in real life models, P(X) is seldom tractable.
  • 2. Until 1999 the modus operandi to solve Bayesian problems was sampling. Algorithms such as Metropolis Hasings or Gibbs used successfully for resolving posterior functions in wide scientific domains. Although they worked well they suffered from two significant problems: • High variance • Slow convergence In 1999 Michael Jordan published a paper "An introduction to variational graphic model" (This is the year that the more famous MJ quitted, which raises the question whether universe is governed by "MJ preservation law") In this paper he described an analytical solution to Bayes problem. In the core of this solution he suggested that rather chasing the posterior function with sampling, we can introduce a distribution function q of the latent variable Z . By setting apriorically q's family functions and a metric we can approximate the posterior function using this q. For this solution he used Euler Lagrange equation which is a fundamental equation in Calculus of Variations ,that brought the notion: Variational Inference (VI). What is variational inference? In this section we describe VI in more details. Recall that we aim to approximate a posterior function P(Z|X) using a function q which is the distribution of Z. The metric that Jordan suggested was KL divergence (https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729694) This idea of analytical solution reduces the obstacles of high variance and the slow convergence. On the other hand, since we set a family of functions we introduce some bias. The following formula illustrates the offered solution: The LHS does not depend on Z therefore it can be considered as a constant. Minimizing KL divergence is equivalent to maximizing the second term: "Evidence Lowe Bound" (ELBO) As a function with pre-defined shape, q has constants that are denoted by λ.
  • 3. We can consider VI as an ELBO maximization problem: Physics! The motivation of Jordan to use this solution came from works of Helmohltz and Boltzmann in thermo dynamics. The ELBO function is extremely similar to Helmholtz free energy where the first term is the energy and second is the entropy of q. Maximizing the ELBO is about increasing the energy and reducing the entropy. Jordan used mean field theorem (MFT) from Ising model, there magnetic spins are assumed to be uncorrelated which simplified the way to handle q. Ising problem has an exponential solution (Boltzmann distribution), which became the common choice for the shape of q VI – Example I will now give an example of how one formulate a VI problem. We will assume that we have a data of real numbers with a Gaussian likelihood. The latent variable Z , is therefore the pair Z ={̢μ, τ } Where the setting of the problem is as follow: Using mean field theorem, q can be written as a product of q (μ, τ) =𝑞μ 𝑞T Where
  • 4. A detailed formulation can be found here https://towardsdatascience.com/variational-inference-in- bayesian-multivariate-gaussian-mixture-model-41c8cc4d82d7) One can see that such formulation requires massive analytical calculations for every model. We will discuss this in one of the sections below when we will talk about BBVI. VI has been strongly boosted in 2002 when Blei published his "Latent Dirichlet allocation", where he used VI for topics extraction There are several available tools to be used for VI such as Edward, Stan, PyMC3 and Pyro Using VI for distribution tail's Inference The problem I had to handle was identifying a tail of an observed data. The data itself is a set of probability values thus we assume that the likelihood has a Beta distribution. Before I discuss the modeling steps I will give an example for tail's typical problem. Let's assume that we have a model that predicts a disease with accuracy of 99%. We can assume that 0.1% of the population is sick. Each day 100,000 people are tested. 990 healthy people will hear that they are sick. This mistake is costly, since they are going to do medical procedures. Such problems occur mainly because models are often trained to optimize accuracy under symmetric assumptions. If we rely on our models, tiny changes in the threshold may cause huge damages. We need therefore to understand the 1% tail as accurately as possible. VI as analytical solution seems to be an interesting tool for achieving this purpose. Surprisingly I found out that the world of DL didn't massively embraced VI, which indeed encouraged me to walk on this way. Deep learning for Variational Inference In this part I will describe my trials to use DL for solving VI problems. Recall that our data's likelihood has a Beta distribution. This distribution has an extremely difficult prior. We therefore take Gamma as a prior since it has a support on the positive real line. In VI terminology Gamma is the prior and q distribution. Hence we train Gamma to sample Z which itself is the pair {α, β} of Beta.
  • 5. Mixture Density Networks -MDN The first framework to be considered was MDN (Mixture Density Networks) . This class of networks that proposed by Bishop (http://publications.aston.ac.uk/id/eprint/373/) aims to learn parameters of a given distribution. We modify the loss: rather using likelihood we use ELBO. In all the plots, red represents our VI based sample and blue is a synthetic sample that has been created by a random engine. Here is atypical outcome of our MDN trials: The VI based sample is relatively close to the synthetic sample. However, it is definitely improvable. Moreover while in the inner part and the left side (which is the interested tail from my perspective) are fairly similar, in the upper tail we suffer from huge differences. We wish to try additional tools.
  • 6. Reinforcement Learning – Actor Critic Variational inference is often interpreted as a reinforcement learning problem (see http://www0.cs.ucl.ac.uk/staff/d.silver/web/Publications_files/viral.pdf) . We consider the q function as the policy function, Z is the option and the reward is the ELBO. When we use RL tools for such problem, we must notice that two modifications are needed: • We don’t have an episode -We have IID samples • The options are not discrete but Gamma distributed In order to resolve the first clause we have to modify the Q learning update formula: As there is no real episode one has to remove the second term, since it indicates the future expected rewards. Regarding the second clause, we have to train a Gamma form policy function (a nice motivation exists here http://proceedings.mlr.press/v70/chou17a/chou17a.pdf) In order to train policy function we will use actor-critic with experience replay https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f and add our modifications. These are the plots that we got:
  • 7.
  • 8. The outcomes are fairly good in an interval[0,1-a) for a small positive. We totally fail to model the upper tail. Black-BOX Variational Inference The appearance of VI provided new directions for studying posterior distributions. Nevertheless it imposed a massive analytical computations of the expectations (see the examples of Gaussian dist. Above). There was a clear objective to find a more generic scheme that reduce the amount of analysis and better handles huge amount of data. The slide below is Blei's description for this need: The offered solution used stochastic optimization. A class of algorithms that has been presented by Robbins and Monro (A stochastic Approximation method 1951)
  • 9. Robbins-Monro Algorithm The Robbins Monro algorithm aims to solve a root problem: Let F a function and α a constant. We assume a unique solution of the eq: F(𝑥) = α We wish to find the 𝑥∗ such that solves this equation. Assume that F is not observable but there exists a random variable T such that 𝐸[𝑇(𝑥)] = 𝐹(x) Robbins and Monro have shown that for certain regulations the following algorithm converges with 𝐿2 : 𝑥𝑡+1 = 𝑥𝑡 - 𝑎 𝑛 (T(𝑥𝑡) – α) Where 𝑎1, 𝑎2 ………..𝑎 𝑛 are positive numbers They satisfy: ∑ 𝑎𝑖 ∞ 𝑖=0 = ∞ , ∑ 𝑎𝑖 2∞ 𝑖=0 < ∞ This sequence is called Robbins Monro sequence. Back to BBVI In a paper from 2013 https://arxiv.org/pdf/1401.0118.pdf ,Blei suggested two steps improvement for VI by using Robbins-Monro sequence. This algorithm has been denoted "Black Box Variational Inference" (BBVI). The first step is presented below:
  • 10. The main idea is to construct a Monte Carlo estimator to the ELBO gradient and use Robbins Monro sequence as described in the section beyond. This algorithm (as many other Monte Carlo algorithms) suffers from high variance. Blei suggested two methods to reduce this variance: • Control Variates – for the estimator 𝑓^ he searches a function g such that: 𝐸 𝑞 [𝑓^ ] = 𝐸 𝑞 [ 𝐠] and 𝑉𝑎𝑟𝑞 [𝑓^ ] < 𝑉𝑎𝑟𝑞 [ 𝐠] 2 Rao- Blackwell – Rao Blackwell theorem claims that if X is an estimator of Z and T is a sufficient statistics then for Y = E(X|T) E[(𝑌 − 𝑍)2 ] ≤ E[(𝑋 − 𝑍)2 ] In the paper Blei uses Rao Blackwell to create an estimator for each value of λ𝑖 by using the Rao Blackwell property for conditional distribution. Now Plots I used the BBVI without the variance reduction techniques .
  • 11.
  • 12. We can see that we reduced the tail issues in comparison to AC, but still suffer from this problem. In the last plot we did not have tail issues but the approximation became weaker Results Summary We saw that using Deep learning tools for VI works with partial success. We performed fairly good approximation within the interval but fail to approximate the upper tail. Moreover in all the frameworks I suffered a certain level of instability. There are several plausible reasons: DL- Issues • Architecture is not sophisticated enough • More Epochs Gamma- Beta issues • Traditionally, DL problems are studied on common distributions such as Gaussian or Uniform. These distributions don't have parameters dependent Skewness and Kurtosis. This is not the case in Gamma, where the parameters that are trained by the engine fully determine the Skewness and Kurtosis. It may require different techniques • We use ELBO as loss function. This function relies strongly on the prior shapes. Gamma is not the real prior of Beta. Which may suggest that loss modifications are required
  • 13. • Beta has a compact support. The fact that we have issues only near the tail (sort of a "singular point" of the distribution) may raise obstacles. This "singularity" issues are to be studied We can summarize that these experiments have shown that VI can be resolved using DL mechanism. However, it requires further work. Nevertheless, we got some clarifications for the enigmatic uncle. A torch code that mimics my trials (the real code has been used for some commercial work I did, sorry ) exists here https://github.com/natank1/DL-_VI Acknowledgements I wish to acknowledge Uri Itai for fruitful discussions before and during the work and for his comprehensive reviewing, to Shlomo Kashani and Yuval Shachaf for their ideas and comments and to Leonard Newnham, for clarifying issues in the reinforcement learning world
  • 14. References • http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf • https://people.eecs.berkeley.edu/~jordan/papers/variational-intro.pdf • https://towardsdatascience.com/a-hitchhikers-guide-to-mixture-density-networks-76b435826cca • https://publications.aston.ac.uk/id/eprint/373/1/NCRG_94_004.pdf • https://www.ri.cmu.edu/wp-content/uploads/2017/06/thesis-Chou.pdf • http://www.michaeltsmith.org.uk/# • https://github.com/trevorcampbell/ubvi/tree/master/examples • https://arxiv.org/pdf/1811.01132.pdf • http://proceedings.mlr.press/v70/chou17a/chou17a.pdf • https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f • https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic- the-hedgehog-86d6240171d/ • http://incompleteideas.net/book/the-book-2nd.html • https://arxiv.org/pdf/1401.0118.pdf • https://towardsdatascience.com/variational-inference-in-bayesian-multivariate-gaussian-mixture- model-41c8cc4d82d7 • https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729586 • http://edwardlib.org/tutorials/klqp • https://mc-stan.org/users/interfaces/ • https://github.com/stan-dev/pystan/tree/develop/pystan • https://docs.pymc.io/notebooks/getting_started.html • https://projecteuclid.org/download/pdf_1/euclid.aoms/1177729694 • https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf • http://publications.aston.ac.uk/id/eprint/373/