This chapter discusses approximate inference methods for probabilistic models where exact inference is intractable. It introduces variational inference as a deterministic approximation approach. Variational inference works by restricting the distribution of latent variables to a simpler family that makes computation and optimization easier. The chapter provides examples of using variational inference for Gaussian mixtures and univariate Gaussian models. It explains how to derive a variational lower bound and optimize it using an iterative procedure similar to EM.
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...Haezoom Inc.
인공신경망을 이용한 generative model로서 많은 관심을 받고 있는 Variational Autoencoder (VAE)를 보다 잘 이해하기 위해서, 여러 가지 재미있는 관점에서 바라봅니다. VAE 및 머신러닝 일반에 지식을 가지고 있는 청중을 대상으로 진행된 세미나 자료입니다. 현장에서 구두로 설명된 부분은 슬라이드의 회색 박스에 보충설명을 적어두었습니다.
Variational Autoencoder를 여러 가지 각도에서 이해하기 (Understanding Variational Autoencod...Haezoom Inc.
인공신경망을 이용한 generative model로서 많은 관심을 받고 있는 Variational Autoencoder (VAE)를 보다 잘 이해하기 위해서, 여러 가지 재미있는 관점에서 바라봅니다. VAE 및 머신러닝 일반에 지식을 가지고 있는 청중을 대상으로 진행된 세미나 자료입니다. 현장에서 구두로 설명된 부분은 슬라이드의 회색 박스에 보충설명을 적어두었습니다.
A Computationally Efficient Algorithm to Solve Generalized Method of Moments ...Waqas Tariq
Generalized method of moment estimating function enables one to estimate regression parameters consistently and efficiently. However, it involves one major computational problem: in complex data settings, solving generalized method of moments estimating function via Newton-Raphson technique gives rise often to non-invertible Jacobian matrices. Thus, parameter estimation becomes unreliable and computationally inefficient. To overcome this problem, we propose to use secant method based on vector divisions instead of the usual Newton-Raphson technique to estimate the regression parameters. This new method of estimation demonstrates a decrease in the number of non-convergence iterations as compared to the Newton-Raphson technique and provides reliable estimates.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
ER Publication,
IJETR, IJMCTR,
Journals,
International Journals,
High Impact Journals,
Monthly Journal,
Good quality Journals,
Research,
Research Papers,
Research Article,
Free Journals, Open access Journals,
erpublication.org,
Engineering Journal,
Science Journals,
GENETIC ALGORITHM FOR FUNCTION APPROXIMATION: AN EXPERIMENTAL INVESTIGATIONijaia
Function Approximation is a popular engineering problems used in system identification or Equation
optimization. Due to the complex search space it requires, AI techniques has been used extensively to spot
the best curves that match the real behavior of the system. Genetic algorithm is known for their fast
convergence and their ability to find an optimal structure of the solution. We propose using a genetic
algorithm as a function approximator. Our attempt will focus on using the polynomial form of the
approximation. After implementing the algorithm, we are going to report our results and compare it with
the real function output.
Quantum algorithm for solving linear systems of equationsXequeMateShannon
Solving linear systems of equations is a common problem that arises both on its own and as a subroutine in more complex problems: given a matrix A and a vector b, find a vector x such that Ax=b. We consider the case where one doesn't need to know the solution x itself, but rather an approximation of the expectation value of some operator associated with x, e.g., x'Mx for some matrix M. In this case, when A is sparse, N by N and has condition number kappa, classical algorithms can find x and estimate x'Mx in O(N sqrt(kappa)) time. Here, we exhibit a quantum algorithm for this task that runs in poly(log N, kappa) time, an exponential improvement over the best classical algorithm.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
1. Chapter 10
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
2. Chapter 10. Approximate Inference
2
What are we doing?
In this chapter, we are doing various mathematical works.
Here, we have to think keep in mind what we are doing. O.W. we just forget what we are doing!
Rewind latent model which we covered in previous chapter.
We should compute the expectation of complete log likelihood.
For gaussian mixture model, this was tractable.
However, many other models are very difficult to perform these tasks.
1. Continuous density should do integration.
2. Discrete density should do summation.
1st case can be intractable when its hard to compute closed form.
2nd case can be intractable when there exist infinite number of possible cases.
e.g. Big-data with quadratic complexity.
Under such conditions, there are two possible treatment.
1. Sampling methods (Chapter 11)
2. Deterministic approximation (In this chapter)
In this chapter, we are going to study some approximate methods, which can be done by changing the functional form!
3. Chapter 10.1. Variational Inference
3
Terminology and idea
Before moving on, we have to talk about some basic terminologies related to this chapter.
1. Function : Mapping that takes the value of a variable as the input and returns the value of the function as the output.
2. Derivative of function : How the output value varies as we make infinitesimal(really small) changes to the input values.
3. Functional : Mapping that takes a function as the input and returns the value of the functional as the output.
4. Example of function is an entropy, 𝐻 𝑝 = 𝑝 𝑥 ln 𝑝(𝑥) 𝑑𝑥
5. Derivative of functional : Same as case 2, but input is a function and output is a functional.
Now, let’s return to the EM algorithm.
Here, 𝐿(𝑞) could be maximized by making KL zero, which
indicates the equality of posterior and prior!
Then, what should we do when it’s hard to compute posterior??
To make this feasible, we can restrict the form of prior, in order to make
posterior computable and flexible!
Yellow : Original distribution
Red : Laplace approximation
Green : Variational approximation
4. Chapter 10.1. Variational Inference
4
Factorized distribution
First, let’s begin with factorization method.
In this example, we restrict the family of prior distribution 𝑞(𝑍).
Here, main idea is ‘partitioning elements of variable 𝒁 to 𝑴 disjoint sets.’
Note that we are not restricting specific functional form of 𝑞𝑖(𝑍𝑖). We only assume they can be partitioned into disjoint sets.
Then here, we can re-write 𝐿(𝑞) by following.
5. Chapter 10.1. Variational Inference
5
Factorized distribution
Under this condition, now suppose we are maximizing 𝐿(𝑞) while fixing 𝑞𝑖≠𝑗.
Solution is easy since 𝑳(𝒒) is a negative KL between 𝒒𝒋 and 𝒑(𝑿, 𝒁𝒋)
So, 𝒎𝒂𝒙𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑳 𝒒 → 𝒎𝒊𝒏𝒊𝒎𝒊𝒛𝒊𝒏𝒈 𝑲𝑳(𝒒| 𝒑 → 𝒒 = 𝒑. Finally, we can get
Note that const. makes this functional form a probability density function!
Thus, we can simply normalize this function!
Let’s revise once.
Original EM has equation of 𝑞 𝑍 = 𝑝(𝑍|𝑋). This was clear pretty good, but if we computing posterior is intractable, we cannot move further.
However, by changing equation to 𝒍𝒏 𝒒𝒋(𝒁𝒋) = 𝑬𝒊≠𝒋 𝒍𝒏 𝒑 𝑿, 𝒁 , we can update prior without computing posterior!
Here, note the equation depends on other 𝑞𝑖 𝑍𝑖 . Thus, it cannot be done in one optimization.
This should be done iteratively, updating every 𝑞𝑖 sequentially.
Now, let’s see an example!
6. Chapter 10.1. Variational Inference
6
Properties of factorized approximations
We can see the characteristic of this method by studying example of multivariate gaussian.
Here, we are going to express factorized gaussian with 𝑞 𝑧 = 𝑞1 𝑧1 𝑞2 𝑧2
By using the following equation,
Luckily, you can see the right-hand side equation can be expressed by quadratic term!
Which means, this equation follows gaussian distribution!
By symmetry,
Fortunately, this example is expressed in
a closed form.
Furthermore, as you can see,
𝐸 𝑧1 = 𝜇1, 𝐸 𝑧2 = 𝜇2.
Thus, the mean of the estimated
distribution fits the original distribution.
This will be shown in a following figure.
7. Chapter 10.1. Variational Inference
7
Properties of factorized approximations
Secondly, there is a new approach, which changes the order of KL-divergence (from 𝐾𝐿(𝑞| 𝑝 → 𝐾𝐿(𝑝||𝑞))
Note that KL divergence is not a symmetry function!
Equation can be written as
KL 𝑝||𝑞 = 𝑝 𝑍 ln 𝒒(𝒁) 𝑑𝑍 + 𝑝(𝑍) ln 𝑝(𝑍) dZ
Equation can be expressed in a closed form by
This means prior distribution of 𝑞 𝑍𝑗 is equal to the
marginal distribution of data itself, 𝑝(𝑍𝑖).
𝑲𝑳(𝒒||𝒑) 𝑲𝑳(𝒑||𝒒) 𝑲𝑳(𝒒||𝒑)
𝑲𝑳(𝒑||𝒒)
Since expectation was same,
it finds the local mean well.
8. Chapter 10.1. Variational Inference
8
Univariate Gaussian Exmaple
Now, let’s think of univariate gaussian example of mean 𝝁 and precision 𝝉, with given dataset 𝐷 = {𝑥1, 𝑥2, … , 𝑥𝑁}
Here, likelihood function and conjugate prior can be given as
Goal of this task is to find posterior distribution of these parameters!
By using factorization, we can express pseudo-posterior by…
Note that this expression is not a perfect posterior.
Major framework
Now, by using 𝒑 𝑫, 𝝁, 𝝉 = 𝒑 𝑫 𝝁, 𝝉 𝒑 𝝁 𝝉 𝒑(𝝉), we can estimate function of 𝒒 𝝁 , 𝒒 𝝉 , respectively.
We don’t need 𝒑(𝝉) here,
so throw it away!
9. Chapter 10.1. Variational Inference
9
Univariate Gaussian Exmaple
Major framework
Interesting part is that we did not assume any function form of 𝑞(. ), but it satisfies the conjugate priors!!
Here, 𝜇 and 𝜏 shares the parameter respectively.
Thus, they should be estimated respectively with 𝐸 𝜇
Let’s see how this iterative optimization occurs!
10. Chapter 10.1. Variational Inference
10
Univariate Gaussian Exmaple
Please see how 𝒒𝝁(𝝁)𝒒𝝉(𝝉) fits the desired true posterior of 𝒑(𝝁, 𝝉|𝑫)
In optimization formula, there exist some moment of
parameters. They can be optimized by…
Which is an UMVE of gaussian variance
11. Chapter 10.1. Variational Inference
11
Model comparison
This methodology can be applied to not only posterior distribution, but also to model comparison!
Let index of each model 𝑚. Unfortunately, each model has different structure, we cannot simply express them in terms of factorization.
Rather, we can use 𝑞 𝑍, 𝑚 = 𝑞 𝑍 𝑚 𝑞(𝑚)
** For convenience, we are assuming discrete variables
(That’s why it has summation)
From this, we can get the proportional relation of
Rather than direct application, we optimize 𝐿 with individual
𝑞(𝑍|m) respectively.
Then, we compute 𝑞(𝑚) and compare them!
12. Chapter 10.2. Variational Mixture of Gaussians
12
Variational Mixture of Gaussians
Let’s re-visit gaussian mixture example.
Here, variational mixture approach can give great help to fully Bayesian treatment.
Here, what does ‘fully Bayesian treatment’ mean?
Consider parameters of 𝜋, 𝜇𝑘, Σ𝑘 in gaussian mixture model.
We estimated them by using EM, but that was actually parametric approach, which means they are fixed constant.
Here, we are talking about fully Bayesian approach, assuming specific distribution for each parameter. Let’s see how formula forms.
Latent distribution
Likelihood
Now, we have to approximate their distribution!
Prior distributions!
13. Chapter 10.2. Variational Mixture of Gaussians
13
Variational distribution
There are several difficulties in straight implementation.
To make things easier, we are using variational method, that uses factorization skill!
Key variational distribution is
Note that they are not exact form.
We are finding these approximations by two-stage,
1. Finding 𝑞(𝑍)
2. Finding 𝑞(𝜋, 𝜇, Λ)
1. Finding 𝒒⋆
(𝒁) We are only taking the distributions
that are related to latent variable 𝑍.
This form can be obtained simply
By using these equations.
Here, D is a
dimensionality of
data-matrix X
14. Chapter 10.2. Variational Mixture of Gaussians
14
Variational distribution
Here, please note that the form of variational distribution of 𝑍 is equivalent to that of prior.
Furthermore, we can compute responsibility as
Furthermore, for the simplicity, let’s define some variables.
15. Chapter 10.2. Variational Mixture of Gaussians
15
Variational distribution
Here, we can discover some interesting facts.
1. 𝜋 only relates to the 𝑍.
2. Other parameters are linked together.
That is, we can decompose this variational distribution into
2. Finding 𝒒⋆
(𝝅, 𝝁, 𝚲)
16. Chapter 10.2. Variational Mixture of Gaussians
16
Variational distribution
For the 𝑀 − 𝑆𝑡𝑒𝑝, it requires estimation oof expectation of some random variables. (Which were denoted by 𝔼())
They can be written as
Here, 𝜓(. ) denotes a digamma function,
which is a log-derivative of gamma function.
Finally, we can compute responsibility as
17. Chapter 10.2. Variational Mixture of Gaussians
17
Variational distribution
Now, let’s revise overall procedure!
1. We compute moments which we covered in previous slide and find estimated moment. (𝐸 − 𝑆𝑡𝑒𝑝)
2. Then, we update each posterior with newly calculated variables. (𝑀 − 𝑆𝑡𝑒𝑝)
3. Iteratively continue this process until converge.
This variational Bayesian GMM has great strength that it automatically
searches optimal number of components.
This means a model sequentially abandons relatively weak component.
Here, weak component indicates a component that contains relatively less
number of data point!
Note that as number of data (𝑁) increases, overall equation converges to that of
basic gaussian mixture.(MLE solution).
Still, this model has strength in overfitting and data collapsing.
18. Chapter 10.2. Variational Mixture of Gaussians
18
Variational lower bound
Consider various deep-learning model. We are printing loss of model every few intervals.
Likewise, we can print lower bound to check whether model is being trained properly.
Because lower bound (L(q)) should never be decreased!
Here, 𝐻 indicates the entropy
value of Wishart
19. Chapter 10.2. Variational Mixture of Gaussians
19
Predictive distribution
Since this is a clustering algorithm, it should be able to predict a newly observed data’s cluster label.
Just like other Bayesian models, it marginalize out the parameters to obtain desired predictive distribution!
I’ll skip this derivation of student t –
distribution in this procedure.
𝑘 which yields the greatest probability can be
chosen as the clustering result!
20. Chapter 10.2. Variational Mixture of Gaussians
20
Determining the number of components
Please check that clustering result is not a fixed one.
It depends on the initialization of each prior values!
Thus, we can define the optimal number of components by computing likelihood value
of each number of components.
This is inappropriate for basic GMM, since likelihood of GMM monotonically increases
as the number of component increases!
However, in Bayesian approach, it has intrinsic trade-off between model complexity
and likelihood value, which acts as automatic regularization.
Thus, comparing the 𝐿(𝑞) is reasonable in Bayesian GMM.
Furthermore, I have mentioned Bayesian GMM has intrinsic power to select optimal
number of components by setting 𝜋𝑘 converges to zero. This notion is called
automatic relevance determination.
21. Chapter 10.3. Variational Linear Regression
21
Fully Bayesian linear regression
Now we are studying variational approach of linear regression.
In previous Bayesian linear models, we did not assume the distribution of priors, which means
𝛼 in the 𝑝 𝑤 𝛼 = 𝑁 𝑤 0, 𝛼−1
𝐼) is a constant which can be optimized by evidence approximation.
However, we are treating 𝛼 as random variable which contains its own distribution!
Here we set distribution of 𝛼 by a gamma distribution.
Joint distribution of parameters can be defined as
Now, let’s find variational distribution of 𝑤 and 𝛼. By using basic notion of factorization method,
Approximation can be done by
Note that 𝑴 indicates the number of
basis functions!
22. Chapter 10.3. Variational Linear Regression
22
Fully Bayesian linear regression
Since it has a form of gamma distribution, we can approximate this variational distribution by gamma!
This way, we can find 𝑞(𝛼). Similarly,
we can get a distribution of 𝑤.
Note that the term of 𝑤 exist in a quadratic form, which indicates a
gaussian distribution!
Note that overall equation only differs at the 𝑬[𝜶] from chp 3.
This expression becomes relatively similar to that of EM, when we
set α0 = β0 = 0, which means we are not having any prior
knowledge about the parameter 𝜶
23. Chapter 10.3. Variational Linear Regression
23
Predictive distribution & Lower bound of variational linear regression
Big idea is same. Just details are different.
But here, we are using 𝐸[𝛼] instead of simple 𝛼
Similarly, we can evaluate lower bound to compare different models.
Ground-Truth : 𝑋3
𝑥 − 𝑎𝑥𝑖𝑠 ∶ 𝑃𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑜𝑓 𝑋
𝑦 − 𝑎𝑥𝑖𝑠 ∶ 𝐿(𝑞)
We can see it is being maximized in 𝑴 = 𝟑
24. Chapter 10.4. Exponential Family Distributions
24
Exponential Family Distributions
We have covered exponential family in mathematical statistics II and PRML chapter II.
Let’s revise it for short.
Note that 𝒈(𝜼) is a normalizing constant, which makes it as a probability!
Why is it important?
1. Most of the distributions we know belong to this exponential family.
2. Distributions that belong to this family have conjugate priors.
3. Have sufficient statistics (UMVUE)
4. Posterior can be expressed as a closed form.
Actually, we have covered various models in an exponential family distribution.
But they were actually some special cases(examples). Thus, let’s study it under general framework of exponential family.
25. Chapter 10.4. Exponential Family Distributions
25
Variational distribution
Now, we have to define variational distribution of 𝑧.
Here, 𝒒 𝒁, 𝜼 = 𝒒 𝒁 𝒒(𝜼). According to formula, we can re-write it by
Variational distribution can be
decomposed into each individual data
E – Step : Computing 𝐸𝜂[𝑢(𝑥𝑛, 𝑧𝑛)] / Filling unseen-latent variable.
M - Step : Updating each distributions.
Furthermore, we can write this
In variational message passing!
26. Chapter 10.5. Local Variational Methods
26
Local approach of variational methods
Note that all previous methods were ‘global approach’, that we have updated equation entirely.
Now, we are going to use local approach, which updates part of the entire equation.
Reason we are doing such method is to simplify the resulting distribution!!
First, we discuss about convex function, which every chord lies on its functional value
Here we get help of some max / min values. Let’s discuss it in example.
Let’s consider function 𝑓 𝑥 = exp(−𝑥) which was drawn in red line.
We are trying to approximate 𝑓(𝑥) with a simpler function.
By using taylor series, we can re-write 𝑓(𝑥) by 𝑓 𝑥 ≈ 𝒚 𝒙 = 𝒇 𝝃 + 𝒇′
𝝃 𝒙 − 𝝃
This is a tangent line of 𝑓(𝑥) at the point of (𝜉, 𝑓(𝜉)).
As you can see in the left-hand side figure, every tangent line lies beneath the original function.
Thus, such inequality satisfies.
𝑦 𝑥 ≤ 𝑓 𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑒𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑥 = 𝜉
Now, above tangent line can be re-written as
𝒚 𝒙 = 𝐞𝐱𝐩 −𝝃 − 𝐞𝐱𝐩(−𝝃)(𝒙 − 𝝃)
27. Chapter 10.5. Local Variational Methods
27
Local approach of variational methods
For generality, let’s assume 𝝀 = −𝒆𝒙𝒑(−𝝃). Equation can be re-written as
𝑦 𝑥, 𝜆 = 𝜆𝑥 − 𝜆 + 𝜆 ln(−𝜆)
Now, let’s fix 𝑥. Then there might be various tangent line.
Like right figure, tangent line at that point gives the maximum value of approximation, which is equal
to the original functional value.
Thus, this can be 𝒇 𝒙 = 𝒎𝒂𝒙
𝝀
(𝝀𝒙 − 𝝀 + 𝝀 𝒍𝒏 𝝀)
Now, let’s move on to more general case of function! We fix 𝝀 and change 𝒙!
We do not set any functional shape, we only guarantee the characteristic of convexity!
Just like left figure, arbitrary line which passes the
origin can be approximated to original function by
moving a bit up(or down).
That intercept value can be denoted as 𝑔(𝜆).
It is a minimum vertical distance between original
𝑓(𝑥) and line 𝑦(𝑥). (By moving 𝑔(𝜆) it kisses the 𝑓(𝑥))
28. Chapter 10.5. Local Variational Methods
28
Local approach of variational methods
Thus, finding λ can be written in
Now, let’s fix 𝒙 and change 𝝀 just as we did in the first example. Equation can be written as
For the concave function, which looks like… (right figure)
** Easy way to discriminate them : Think of ‘cave’ and shape of it!
Back to the point, we can apply same equations, only changing the side and sign of them.
That is, max to min and min to max!
Actually, we don’t need to memorize it. Just draw a simple function and simply think of it!
29. Chapter 10.5. Local Variational Methods
29
Logistic sigmoid example
Upper bound of logistic.
Consider logistic sigmoid function, which is defined as 𝜎 𝑥 =
1
1+𝑒−𝑥.
Note that it is neither concave nor convex.
However, if we take logarithm, resulting value is concave. Therefore, we can derive
It is binary cross entropy of binary classification model!!
We then obtain upper bound by
𝐥𝐧 𝝈(𝒙) ≤ 𝝀𝒙 − 𝒈(𝝀) 𝝈 𝒙 ≤ 𝒆𝒙𝒑(𝝀𝒙 − 𝒈(𝝀))
That is, we can find the upper bound of logistic function! (The reason why we are finding upper/lower
bound will be covered soon!)
Lower bound of logistic.
Deriving lower bound is a bit more complicated. Let’s re-write ln 𝜎(𝑥) by…
30. Chapter 10.5. Local Variational Methods
30
Logistic sigmoid example
Note that
𝒙
𝟐
is a monotonically increase function, and − ln(. ) part is a convex. Thus, we can approximate lower bound with
𝑓 𝑥 = − ln 𝑒
𝑥
2 + 𝑒−
𝑥
2, which is a convex function with respect to 𝑥2
. (Let’s skip the proof)
By using general idea of convex function and max/min relation, we can compute
Consider the kissing point as 𝜉. We can set derivative zero and rewrite equation!
Finally, we can derive lower bound of logistic as follows.
By using exponential
function…
31. Chapter 10.5. Local Variational Methods
31
Logistic sigmoid example
Actually, this is the main part of this sub-section.
Why are we doing such complicated works???
It’s because it can make intractable integration possible.
Consider Bayesian predictive distribution!
Note that 𝑝(𝑎) is a gaussian distribution!
However, it was really hard to compute. Then, what if we replace 𝜎(𝑎) with its lower bound,
𝝈 𝒂 ≥ 𝒇(𝒂, 𝝃)
Here, we have to choose 𝜉, which gives maximal value to 𝑭(𝝃)!
Note that choice of 𝝃 depends on the choice of 𝒂.
Let’s see how this idea is being used in the following section!
32. Chapter 10.6. Variational Logistic Regression
32
Variational approximation of logistic
Note that it was nearly impossible to find closed form of posterior / predictive distribution in logistic regression.
Thus, we used Laplace approximation for computing such values.
Now, we are going to use variational method to approximate them!
As we did in previous section, we are going to find them by using lower bound. More precisely, maximizing the lower bound!
By IID. Assumption, we can express
Where 𝒂 = 𝑾𝑻
𝝓
Here, please remember how we solved lower bound of logistic!
33. Chapter 10.6. Variational Logistic Regression
33
Variational approximation of logistic
1. Note that 𝑎(input of logistic) was 𝑊𝑇
𝜙.
2. Under variational setting, we assume each data are independent and we apply different 𝝃𝒏 for every data point! Thus,
In order to achieve the computational convenience here,
We put logarithm on each value!
Here, we are trying to get 𝒒(𝒘). By this, we can induce…
34. Chapter 10.6. Variational Logistic Regression
34
Variational approximation of logistic
Here, we consider 𝜉 as a certain constant. Thus, we can sort right-hand side equation with respect to 𝑤.
Note that 𝐰 is having a form of quadratic equation! Thus, we can assume 𝒘 follows Normal distribution!!
There is one interesting part in this equation.
That is, we can calculate this in a sequential manner! Why?
Likewise, 𝑚𝑁 can also be computed in this way!
35. Chapter 10.6. Variational Logistic Regression
35
Optimizing the variational parameters
We use EM for parameter optimization of 𝑚, 𝑆 𝑎𝑛𝑑 𝜉.
It would be beneficial to think of 𝑤 as a latent variable! Let’s revise EM for short.
Note that this expectation
was done with respect to 𝒘!
Does not
depend on 𝝃
Canceled out
with following
𝟎. 𝟓𝑾𝑻𝝓
In order to find new 𝜉, we
have to find derivative of this
𝑄 with respect to 𝜉.
36. Chapter 10.6. Variational Logistic Regression
36
New variational parameter 𝝃𝒏
Let’s revise EM in variational logistic for short.
1. Initialize 𝜉𝑛
𝑜𝑙𝑑
2. Evaluate posterior distribution of q(𝑤)
3. Compute 𝜉𝑛
𝑛𝑒𝑤
Otherwise, we can simply(?) calculate 𝐿(𝑄) and set derivative of 𝜉
For the predictive distribution(which is most crucial in logistic regression),
Then we can use same techniques after this, as we did in chapter 4.
37. Chapter 10.6. Variational Logistic Regression
37
Inference of hyper parameters
There was a prior variance 𝑆0. We have assumed that we know that value, but now we treat it as a random variable.
This is also a fully Bayesian treatment! Consider prior distribution is given as
𝑝 𝑤 𝛼 = 𝑁(𝑤|0, 𝛼−1
𝐼)
As usual, we set distribution of 𝛼 as a conjugate gamma. Which is
𝑝 𝛼 = 𝐺𝑎𝑚𝑚𝑎(𝛼|𝑎0, 𝑏0)
Marginal model likelihood and joint distribution are given as
As we have done so-many times, we introduce variational distribution 𝒒(𝒘, 𝜶).
To perform expectation maximization, we can again compute marginal likelihood function by
However, note that 𝐿(𝑞) is intractable
due to existence of 𝑝(𝑡|𝑤).
Thus, we apply lower bound here.
38. Chapter 10.6. Variational Logistic Regression
38
Inference of hyper parameters
By using the fact of
After it, let’s compute 𝒒 𝒘, 𝜶 = 𝒒 𝒘 𝒒(𝜶) indivually!
We can compute lower-bound as follows
Why Normal?
Because 𝒘 exists in a quadratic
form under log function!
Why Gamma?
Because 𝜶 exists in a 1st order
form under log function!
39. Chapter 10.6. Variational Logistic Regression
39
Inference of hyper parameters
Furthermore, we have to compute 𝜉 for full model! For expectation values, we can compute as follows!
Let’s revise variational approach once again.
1. Sometimes its hard to compute posterior and integration of certain functions.
2. We assume variational structure, which considers one variable with one latent value or parameter.
3. It can be broken into two parts like 𝑞(𝑤, 𝛼).
4. Reason why we are doing such approximation is because posterior or other integration is intractable under full-equation.
5. This is helpful in fully-Bayesian model, which also assumes distribution for hyper-priors.
6. Note that optimization of such models can be done by EM algorithm.
Let’s skip expectation propagation. We’ll cover it if we have time… (End of summer break is coming… Last semester is about to begin…!)