A 3hrs intro lecture to Approximate Bayesian Computation (ABC), given as part of a PhD course at Lund University, February 2016. For sample codes see http://www.maths.lu.se/kurshemsida/phd-course-fms020f-nams002-statistical-inference-for-partially-observed-stochastic-processes/
ABC with data cloning for MLE in state space modelsUmberto Picchini
An application of the "data cloning" method for parameter estimation via MLE aided by Approximate Bayesian Computation. The relevant paper is http://arxiv.org/abs/1505.06318
My data are incomplete and noisy: Information-reduction statistical methods f...Umberto Picchini
We review parameter inference for stochastic modelling in complex scenario, such as bad parameters initialization and near-chaotic dynamics. We show how state-of-art methods for state-space models can fail while, in some situations, reducing data to summary statistics (information reduction) enables robust estimation. Wood's synthetic likelihoods method is reviewed and the lecture closes with an example of approximate Bayesian computation methodology.
Accompanying code is available at https://github.com/umbertopicchini/pomp-ricker and https://github.com/umbertopicchini/abc_g-and-k
Readership lecture given at Lund University on 7 June 2016. The lecture is of popular science nature hence mathematical detail is kept to a minimum. However numerous links and references are offered for further reading.
Numerical Fourier transform based on hyperfunction theoryHidenoriOgata
This is the PC slide of a contributed talk in the conference "ECMI2018 (The 20th European Conference on Mathematics for Industry)", 18-20 June 2018, Budapest, Hungary. In this talk, we propose a numerical method of Fourier transforms based on hyperfunction theory.
A 3hrs intro lecture to Approximate Bayesian Computation (ABC), given as part of a PhD course at Lund University, February 2016. For sample codes see http://www.maths.lu.se/kurshemsida/phd-course-fms020f-nams002-statistical-inference-for-partially-observed-stochastic-processes/
ABC with data cloning for MLE in state space modelsUmberto Picchini
An application of the "data cloning" method for parameter estimation via MLE aided by Approximate Bayesian Computation. The relevant paper is http://arxiv.org/abs/1505.06318
My data are incomplete and noisy: Information-reduction statistical methods f...Umberto Picchini
We review parameter inference for stochastic modelling in complex scenario, such as bad parameters initialization and near-chaotic dynamics. We show how state-of-art methods for state-space models can fail while, in some situations, reducing data to summary statistics (information reduction) enables robust estimation. Wood's synthetic likelihoods method is reviewed and the lecture closes with an example of approximate Bayesian computation methodology.
Accompanying code is available at https://github.com/umbertopicchini/pomp-ricker and https://github.com/umbertopicchini/abc_g-and-k
Readership lecture given at Lund University on 7 June 2016. The lecture is of popular science nature hence mathematical detail is kept to a minimum. However numerous links and references are offered for further reading.
Numerical Fourier transform based on hyperfunction theoryHidenoriOgata
This is the PC slide of a contributed talk in the conference "ECMI2018 (The 20th European Conference on Mathematics for Industry)", 18-20 June 2018, Budapest, Hungary. In this talk, we propose a numerical method of Fourier transforms based on hyperfunction theory.
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
Despite the title the methods are appropriate for more general dynamical models (including state-space models). Presentation given at Nordstat 2012, Umeå. Relevant research paper at http://arxiv.org/abs/1204.5459 and software code at https://sourceforge.net/projects/abc-sde/
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionYu Liu
The paper Combinatorial Model and Bounds for Target Set Selection by Eyal Ackerman, Oren Ben-Zwi, Guy Wolfovitz:
1. a combinatorial model for the dynamic activation process of
influential networks;
2. representing Perfect Target Set Selection Problem and its
variants by linear integer programs;
3. combinatorial lower and upper bounds on the size of the
minimum Perfect Target Set
We apply tensor train (TT) data format to solve an elliptic PDE with uncertain coefficients. We reduce complexity and storage from exponential to linear. Post-processing in TT format is also provided.
Conditional mixture model for modeling attributed dyadic dataLoc Nguyen
Dyadic data contains co-occurrences of objects, which is often modeled by finite mixture model which in turn is learned by expectation maximization (EM) algorithm. Objects in traditional dyadic data are identified by names, causing the drawback which is that it is impossible to extract implicit valuable knowledge under objects. In this research, I propose the so-called attributed dyadic data (ADD) in which each object has an informative attribute and each co-occurrence of two objects is associated with a value. ADD is flexible and covers most of structures / forms of dyadic data. Conditional mixture model (CMM), which is a variant of finite mixture model, is applied into learning ADD. Moreover, a significant feature of CMM is that any co-occurrence of two objects is based on some conditional variable. As a result, CMM can predict or estimate co-occurrent values based on regression model, which extends applications of ADD and CMM.
Inference for stochastic differential equations via approximate Bayesian comp...Umberto Picchini
Despite the title the methods are appropriate for more general dynamical models (including state-space models). Presentation given at Nordstat 2012, Umeå. Relevant research paper at http://arxiv.org/abs/1204.5459 and software code at https://sourceforge.net/projects/abc-sde/
Paper Introduction: Combinatorial Model and Bounds for Target Set SelectionYu Liu
The paper Combinatorial Model and Bounds for Target Set Selection by Eyal Ackerman, Oren Ben-Zwi, Guy Wolfovitz:
1. a combinatorial model for the dynamic activation process of
influential networks;
2. representing Perfect Target Set Selection Problem and its
variants by linear integer programs;
3. combinatorial lower and upper bounds on the size of the
minimum Perfect Target Set
We apply tensor train (TT) data format to solve an elliptic PDE with uncertain coefficients. We reduce complexity and storage from exponential to linear. Post-processing in TT format is also provided.
Conditional mixture model for modeling attributed dyadic dataLoc Nguyen
Dyadic data contains co-occurrences of objects, which is often modeled by finite mixture model which in turn is learned by expectation maximization (EM) algorithm. Objects in traditional dyadic data are identified by names, causing the drawback which is that it is impossible to extract implicit valuable knowledge under objects. In this research, I propose the so-called attributed dyadic data (ADD) in which each object has an informative attribute and each co-occurrence of two objects is associated with a value. ADD is flexible and covers most of structures / forms of dyadic data. Conditional mixture model (CMM), which is a variant of finite mixture model, is applied into learning ADD. Moreover, a significant feature of CMM is that any co-occurrence of two objects is based on some conditional variable. As a result, CMM can predict or estimate co-occurrent values based on regression model, which extends applications of ADD and CMM.
Typically quantifying uncertainty requires many evaluations of a computational model or simulator. If a simulator is computationally expensive and/or high-dimensional, working directly with a simulator often proves intractable. Surrogates of expensive simulators are popular and powerful tools for overcoming these challenges. I will give an overview of surrogate approaches from an applied math perspective and from a statistics perspective with the goal of setting the stage for the "other" community.
Runtime Analysis of Population-based Evolutionary AlgorithmsPer Kristian Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can
only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been
developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their
parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to
simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime
analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection
mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to
the analysis of populations.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient
optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and
selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
Populations are at the heart of evolutionary algorithms (EAs). They provide the genetic variation which selection acts upon. A complete picture of EAs can only be obtained if we understand their population dynamics. A rich theory on runtime analysis (also called time-complexity analysis) of EAs has been developed over the last 20 years. The goal of this theory is to show, via rigorous mathematical means, how the performance of EAs depends on their parameter settings and the characteristics of the underlying fitness landscapes. Initially, runtime analysis of EAs was mostly restricted to simplified EAs that do not employ large populations, such as the (1+1) EA. This tutorial introduces more recent techniques that enable runtime analysis of EAs with realistic population sizes.
The tutorial begins with a brief overview of the population‐based EAs that are covered by the techniques. We recall the common stochastic selection mechanisms and how to measure the selection pressure they induce. The main part of the tutorial covers in detail widely applicable techniques tailored to the analysis of populations. We discuss random family trees and branching processes, drift and concentration of measure in populations, and level‐based analyses.
To illustrate how these techniques can be applied, we consider several fundamental questions: When are populations necessary for efficient optimisation with EAs? What is the appropriate balance between exploration and exploitation and how does this depend on relationships between mutation and selection rates? What determines an EA's tolerance for uncertainty, e.g. in form of noisy or partially available fitness?
This tutorial was presented at the 2015 IEEE Congress on Evolutionary Computation at Sendai, Japan, May 25th 2015.
Expectation maximization (EM) algorithm is a popular and powerful mathematical method for parameter estimation in case that there exist both observed data and hidden data. Therefore, EM is appropriate to applications which aim to exploit latent aspects under heterogeneous data. This report focuses on probabilistic finite mixture model which is a popular and successful application of EM, which is fully explained in my book (Nguyen, Tutorial on EM algorithm, 2020, pp. 78-88). I also proposed a special regression model associated with mixture model in which missing values are acceptable.
Handling missing data with expectation maximization algorithmLoc Nguyen
Expectation maximization (EM) algorithm is a powerful mathematical tool for estimating parameter of statistical models in case of incomplete data or hidden data. EM assumes that there is a relationship between hidden data and observed data, which can be a joint distribution or a mapping function. Therefore, this implies another implicit relationship between parameter estimation and data imputation. If missing data which contains missing values is considered as hidden data, it is very natural to handle missing data by EM algorithm. Handling missing data is not a new research but this report focuses on the theoretical base with detailed mathematical proofs for fulfilling missing values with EM. Besides, multinormal distribution and multinomial distribution are the two sample statistical models which are concerned to hold missing values.
On the Family of Concept Forming Operators in Polyadic FCADmitrii Ignatov
Triadic Formal Concept Analysis (3FCA) was introduced by Lehman and Wille almost two decades ago. And many researchers work in Data Mining and Formal Concept Analysis using the notions of closed sets, Galois and closure operators, closure systems. However, up-to-date even though that different researchers actively work on mining triadic and n-ary relations, a proper closure operator for enumeration of triconcepts, i.e. maximal triadic cliques of tripartite hypergaphs, was not introduced. In this talk we show that the previously introduced operators for obtaining triconcepts are not always consistent, describe their family and study their properties. We also introduce the notion of maximal switching generator to explain why such concept-forming operators are not closure operators due to violation of monotonicity property.
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsSean Meyn
A tutorial, and very new algorithms -- more details on arXiv and at NIPS 2017 https://arxiv.org/abs/1707.03770
Part of the Data Science Summer School at École Polytechnique: http://www.ds3-datascience-polytechnique.fr/program/
---------
2018 Updates:
See Zap slides from ISMP 2018 for new inverse-free optimal algorithms
Simons tutorial, March 2018 [one month before most discoveries announced at ISMP]
Part I (Basics, with focus on variance of algorithms)
https://www.youtube.com/watch?v=dhEF5pfYmvc
Part II (Zap Q-learning)
https://www.youtube.com/watch?v=Y3w8f1xIb6s
Big 2017 survey on variance in SA:
Fastest convergence for Q-learning
https://arxiv.org/abs/1707.03770
You will find the infinite-variance Q result there.
Our NIPS 2017 paper is distilled from this.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
A Tutorial of the EM-algorithm and Its Application to Outlier Detection
1. A Tutorial of the EM-algorithm and its Application to
Outlier Detection
Jaehyeong Ahn
Konkuk University
jayahn0104@gmail.com
September 9, 2020
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 1 / 47
2. Table of Contents
1 EM-algorithm: An Overview
2 Proof for EM-algorithm
Non-decreasing (Ascent Property)
Convergence
Local Maximum
3 An Example: Gaussian Mixture Model (GMM)
4 Application to Outlier Detection
Directly Used
Indirectly Used
5 Summary
6 References
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 2 / 47
4. EM-algorithm: An Overview
Introduction
The EM-algorithm (Expectation-Maximization algorithm) is an
iterative procedure for computing the maximum likelihood estimator
(MLE) when only a subset of the data is available (When the model
depends on the unobserved latent variable)
The first proper theoretical study of the algorithm was done by
Dempster, Laird, and Rubin (1977) [1]
The EM-algorithm is widely used in various research areas when
unobserved latent variables are included in the model
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 4 / 47
5. EM-algorithm: An Overview
Data
Y = (y, · · · , yN )T : Observed data
Model
Assume that Y is dependent on some unobserved latent variable Z
where Z = (z, · · · , zN )T
When Z is assumed as discrete random variable
rik = P(zi = k|Y, θ), k = , · · · , K
z∗
i = argmax
k
rik
When Z is assumed as continuous random variable
ri = fZ|Y,Θ(zi|Y, θ)
Log-likelihood
obs(θ; Y ) = log fY |Θ(Y |θ) = log fY,Z|Θ(Y, Z|θ) dz
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 5 / 47
6. EM-algorithm: An Overview
Goal
By maximizing obs(θ; Y ) w.r.t. θ
Find ˆθ which satisfies ∂θj obs(θ; Y )|θ=ˆθ =0, for j = , · · · , J
Compute the estimated value ˆrik = fZ|Y , Θ(zi|Y, ˆθ)
Problem
The latent variable Z is not observable
It is difficult to compute the integral in obs(θ; Y )
Thus the parameters can not be estimated separately
Solution
Assume the latent variable Z is observed
Define the complete Data
Maximize the complete log likelihood
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 6 / 47
7. EM-algorithm: An Overview
Data
Y = (y, · · · , yN )T : Observed data
Z = (z, · · · , zN )T : Unobserved (latent) variable
It is assumed as observed
X = (Y, Z): Complete Data
Model
ri = fZ|Y,Θ(zi|Y, θ)
Complete log-likelihood
C(θ; X) = log fX|Θ(X|θ) = log fX|Θ(Y, Z|θ)
Log-likelihood for observed data
obs(θ; Y ) = log fY |Θ(Y |θ) = log fX|Θ(Y, Z|θ) dz
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 7 / 47
8. EM-algorithm: An Overview
Estimation Idea
Maximize C(θ; X) instead of maximizing obs(θ; Y )
Since the parameters in C(θ; X) can be decoupled
E-step
Compute the Expected Complete Log Likelihood (ECLL)
Taking conditional expectation on C (θ; X) given Y and current value
of parameters θ(s)
This step estimates the realizations of z (since the value of z is not
identified)
M-step
Maximize the computed ECLL w.r.t. θ
Update the estimates of Θ by θ(s+)
which maximize the current
ECLL
Iterate this procedure until it converges
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 8 / 47
9. EM-algorithm: An Overview
Estimation
E-step
Taking conditional expectation on C(θ; X) given Y and θ = θ(s)
For θ()
, set initial guess
Compute Q(θ|θ(s)) = Eθ(s) [ C(θ; X)|Y ]
M-step
Maximize Q(θ|θ(s)) w.r.t. θ
Put θ(s+) = argmax
θ
Q(θ|θ(s))
Iterate until it satisfies following inequality
||θ(s+)
− θ(s)
|| < , where denotes the sufficiently small value
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 9 / 47
10. EM-algorithm: An Overview
Immediate Question
Does maximizing the sequence Q(θ|θ(s)) leads to maximizing
obs(θ; Y ) ?
This question will be answered in following slides (Proof for
EM-algorithm) with 3 parts:
Non-decreasing / Convergence / Local maximum
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 10 / 47
11. Proof for EM-algorithm
Proof for EM-algorithm
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 11 / 47
12. Proof for EM-algorithm: Non-decreasing
Non-decreasing (Ascent Property)
Proposition 1.
The Sequence obs(θ(s); Y ) in the EM-algorithm is non-decreasing
Proof
We write X = (Y, Z) for the complete data
Then
fZ|Y,Θ(Z|Y, θ) =
fX|Θ(Y,Z|θ)
fY |Θ(Y |θ)
Hence,
obs(θ; Y ) = log fY |Θ(Y |θ) = log fX|Θ((Y, Z)|θ) − log fZ|Y,Θ(Z|Y, θ)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 12 / 47
13. Proof for EM-algorithm: Non-decreasing
Taking conditional expectation given Y and Θ = θ(s)
on both sides
yields
obs(θ; Y ) = Eθ(s) [ obs(θ; Y )|Y ]
= Eθ(s) [log fX|Θ((Y, Z)|θ)|Y ] − Eθ(s) [log fZ|Y,Θ(Z|Y, θ)|Y ]
= Q(θ|θ(s)
) − H(θ|θ(s)
)
Where
Q(θ|θ(s)
) = Eθ(s) [log fX|Θ((Y, Z)|θ)|Y ]
H(θ|θ(s)
) = Eθ(s) [log fZ|Y,Θ(Z|Y, θ)|Y ]
Then we have
obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) = Q(θ(s+)
|θ(s)
) − Q(θ(s)
|θ(s)
)
− H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 13 / 47
14. Proof for EM-algorithm: Non-decreasing
Recall that
obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) = Q(θ(s+)
|θ(s)
) − Q(θ(s)
|θ(s)
)
(I)
− H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
)
(II)
(I) is non-negative
θ(s+)
= argmax
θ
Q(θ|θ(s)
)
Hence Q(θ(s+)
|θ(s)
) ≥ Q(θ(s)
|θ(s)
)
Thus (I) ≥ 0
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 14 / 47
15. Proof for EM-algorithm: Non-decreasing
(II) is non-positive
Using Jensen’s inequality for concave functions (log is concave)
Theroem 1. Jensen’s inequality
Let f be a concave function, and let X be a random variable.
Then
E[f(X)] ≤ f(EX)
H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
) = Eθ(s) log
fZ|Y,Θ(Z|Y, θ(s+)
)
fZ|Y,Θ(Z|Y, θ(s))
|Y
≤ log Eθ(s)
fZ|Y,Θ(Z|Y, θ(s+)
)
fZ|Y,Θ(Z|Y, θ(s))
|Y
= log
fZ|Y,Θ(z|Y, θ(s+)
)
fZ|Y,Θ(z|Y, θ(s))
fZ|Y,Θ(z|Y, θ(s)
) dz
= log1 = 0
Hence H(θ(s+)
|θ(s)
) ≤ H(θ(s)
|θ(s)
)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 15 / 47
16. Proof for EM-algorithm: Non-decreasing
Theroem 1. Jensen’s inequality
Let f be a concave function, and let X be a random variable.
Then
E[f(X)] ≤ f(EX)
Figure: Jensen’s inequality for concave function[3]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 16 / 47
17. Proof for EM-algorithm: Non-decreasing
Recall that
obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) = Q(θ(s+)
|θ(s)
) − Q(θ(s)
|θ(s)
)
(I)
− H(θ(s+)
|θ(s)
) − H(θ(s)
|θ(s)
)
(II)
We’ve proven that (I) ≥ 0 and (II) ≤ 0
This shows obs(θ(s+)
; Y ) − obs(θ(s)
; Y ) ≥0
Thus the sequence obs(θ(s)
; Y ) in the EM-algorithm is
non-decreasing
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 17 / 47
18. Proof for EM-algorithm: Convergence
Convergence
We will show that the sequence θ(s) converges to some θ∗ with
(θ∗; y) = ∗, the limit of (θ(s))
Assumption
Ω is a subset of Rk
Ωθ = {θ ∈ Ω : (θ; y) ≥ (θ; y)} is compact for any
(θ; y) > −∞
(θ; x) is continuous and differentiable in the interior of Ω
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 18 / 47
19. Proof for EM-algorithm: Convergence
Theorem 2
Suppose that Q(θ|φ) is continuous in both θ and φ. Then all
limit points of any instance {θ(s)} of the EM algorithm are
stationary points, i.e. θ∗ = argmax
θ
Q(θ|θ∗), and (θ(s); y)
converges monotonically to some value ∗ = (θ∗; y) for some
stationary point θ∗
Theorem 3
Assume the hypothesis of Theorem 2. Suppose in addition that
∂θQ(θ|φ) is continuous in θ and φ. Then θ(s) converges to a
stationary point θ∗ with (θ∗; y) = ∗, the limit of (θ(s)), if either
{θ : (θ; y) = ∗} = {θ∗} or
|θ(s+) − θ(s)| →0 and {θ : (θ; y) = ∗} is discrete
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 19 / 47
20. Proof for EM-algorithm: Local Maximum
Local Maximum
Recall that
C(θ; X) = log fX|Θ(X|θ) = log fZ|Y,Θ(Z|Y, θ) + log fY |Θ(Y |θ)
Then
Q(θ|θ(s)
) = log fZ|Y,Θ(z|Y, θ)fZ|Y,Θ(z|Y, θ(s)
) dz + obs(θ; Y )
Differentiating w.r.t. θj and putting equal to 0 in order to maximize
Q gives
0 = ∂θj
Q(θ|θ(s)
) =
∂θj fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ(s)
) dz + ∂θj obs(θ, Y )
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 20 / 47
21. Proof for EM-algorithm: Local Maximum
Recall that
0 = ∂θj
Q(θ|θ(s)
) =
∂θj fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ)
fZ|Y,Θ(z|Y, θ(s)
) dz + ∂θj obs(θ, Y )
If θ(s) → θ∗ then we have for θ∗ that (with j = 1, · · · , J)
0 = ∂θj
Q(θ∗
|θ∗
)
=
∂θj
fZ|Y,Θ(z|Y, θ∗)
fZ|Y,Θ(z|Y, θ∗)
fZ|Y,Θ(z|Y, θ∗
) dz + ∂θj obs(θ∗
; Y )
= ∂θj
fZ|Y,Θ(z|Y, θ∗
) dz + ∂θj obs(θ∗
; Y )
= ∂θj obs(θ∗
; Y )
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 21 / 47
22. An Example: Gaussian Mixture Model (GMM)
An Example: Gaussian Mixture Model (GMM)
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 22 / 47
23. An Example: Gaussian Mixture Model (GMM)
Introduction
Mixture models make use of latent variables to model different
parameters for different groups (or clusters) of data points
For a point yi, let the cluster to which that point belongs be labeled
zi; where zi is latent, or unobserved
In this example, we will assume our observable features yi to be
distributed as a Gaussian, chosen based on the cluster that point yi is
associated with
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 23 / 47
24. An Example: Gaussian Mixture Model (GMM)
Figure: Gaussian Mixture Model Example[5]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 24 / 47
25. An Example: Gaussian Mixture Model (GMM)
Data
Y = (y, · · · , yN )T : Observed data
∀i yi ∈ Rp
Z = (z, · · · , zN )T : Unobserved (latent) variable
Assume that Z is observed
∀i zi ∈ {, , · · · , K}
X = (Y, Z): Complete data
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 25 / 47
26. An Example: Gaussian Mixture Model (GMM)
Distribution Assumption
zi ∼ Mult(π), π ∈ RK
yi|zi = k ∼ NP (µk, Σk)
Model
rik
def
= p(zi = k|yi, θ) =
p(yi|zi = k, θ)p(zi = k|θ)
K
k= (p(yi|zi = k, θ)p(zi = k|θ))
z∗
i = argmax
k
rik
θ denotes the general parameter; θ = {π, µ, Σ}
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 26 / 47
27. An Example: Gaussian Mixture Model (GMM)
Notation Simplification
Write zi = k as
zi = (zi, · · · , zik, · · · , ziK)T
= (0, · · · , 1, · · · , 0)T
Where zij = I(j = k) ∈{0, 1} for j = , · · · , K
Using this:
p(zi|π) =
K
k=
π
zk
i
k
p(yi|zi, θ) =
K
k=
N(µk, Σk)zk
i =
K
k=
φ(yi|µk,k )zk
i
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 27 / 47
28. An Example: Gaussian Mixture Model (GMM)
Log-likelihood for observed data
obs(θ; Y ) = log fY |Θ(Y |θ)
=
N
i=
log p(yi|θ)
=
N
i=
log
K
zi
p(yi, zi|θ)
=
N
i=
log
zi∈Z
K
k=
π
zk
i
k N(µk, Σk)zk
i
This does not decouple the likelihood because the log cannot be
‘pushed’ inside the summation
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 28 / 47
29. An Example: Gaussian Mixture Model (GMM)
Complete log-likelihood
C(θ; X) = log fX|Θ(Y, Z|θ)
= log fY |Z,Θ(Y |Z, θ) + log fZ|Θ(Z|θ)
=
N
i=
log
K
k=
φ(yi|µk, Σk)zk
i + log
k=
π
zk
i
k
=
N
i=
K
k=
zk
i log φ(yi|µk, Σk) + zk
i log πk
Parameters are now decoupled since we can estimate πk and µk , Σk
separately
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 29 / 47
30. An Example: Gaussian Mixture Model (GMM)
Estimation
E-step
Q(θ|θ(s)
) = Eθ(s)
N
i=1
K
k=1
zk
i log φ(yi |µk , Σk ) + zk
i log πk |Y
=
N
i=
K
k=
Eθ(s) zk
i |Y log φ(yi|µk, Σk) + Eθ(s) zk
i |Y log πk
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 30 / 47
31. An Example: Gaussian Mixture Model (GMM)
E-step
Note that zk
i =1|Y ∼ Berrnoulli p(zk
i = 1|Y, θ)
Hence
r
(s)
ik
def
= Eθ(s) zk
i |Y = p(zk
i = 1|Y, θ(s)
)
=
p(zk
i = 1, yi|θ(s)
)
K
k=1 p(zk
i = , yi|θ(s))
=
p(yi|zk
i = 1, θ(s)
) p(zk
i = 1|θ(s)
)
K
k= p(yi|zk
i = 1, θ(s)) p(zk
i = 1|θ(s))
=
φ(yi|µ
(s)
k , Σ
(s)
k ) π
(s)
k
K
k= φ(yi|µ
(s)
k , Σ
(s)
k ) π
(s)
k
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 31 / 47
32. An Example: Gaussian Mixture Model (GMM)
M-step
Recall that
Q(θ|θ(s)
) =
N
i=
K
k=
r
(s)
ik log φ(yi|µk, Σk) + r
(s)
ik log πk
Set θ(s+)
= argmax
θ
Q(θ|θ(s)
)
π
(s+)
k =
N
i= r
(s)
ik
N
µ
(s+)
k =
N
i= r
(s)
ik
yi
N
i= r
(s)
ik
Σ
(s+)
k =
N
i= r
(s)
ik
yi−µ
(s+)
k
yi−µ
(s+)
k
T
N
i= r
(s)
ik
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 32 / 47
33. An Example: Gaussian Mixture Model (GMM)
Iterate until it satisfies following inequality
||θ(s+)
− θ(s)
|| <
Let ˆθ = θ(s)
What we get
ˆθ = (ˆπ, ˆµ, ˆΣ)
ˆrik = p(zi = k|Y, ˆθ) = φ(yi|ˆµk, ˆΣk) ˆπk
K
k= φ(yi|ˆµk, ˆΣk) ˆπk
ˆzi = argmax
k
ˆrik
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 33 / 47
34. An Example: Gaussian Mixture Model (GMM)
Figure: Gaussian Mixture Model Fitting Example[5]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 34 / 47
35. Application to Outlier Detection
Application to Outlier Detection
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 35 / 47
36. Application to Outlier Detection: Directly Used
Basic Idea
For cases in which the data may have many different clusters with
different orientations
Assume a specific form of the generative model (e.g., a mixture of
Gaussians)
Fit the model to the data (usually for normal behavior)
Estimate the parameters with EM-algorithm
Fit this model to the unseen (test) data and get the estimation of the
fit (joint) probabilities
Data points that fit the distribution will have high fit (joint)
probabilities
Whereas anomalies (outliers) will have very low fit (joint) probabilities
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 36 / 47
37. Application to Outlier Detection: Directly Used
Simulation in R
https://rpubs.com/JayAhn/650433
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 37 / 47
38. Application to Outlier Detection: Indirectly Used
Introduction
Interestingly, EM-algorithms can also be used as a final step after
many such outlier detection algorithms for converting the scores into
probabilities [7]
Converting the outlier scores into well-calibrated probability estimates
is more favorable for several reasons
1 The probability estimates allow us to select the appropriate threshold
for declaring outliers using a Bayesian risk model
2 The probability estimates obtained from individual models can be
aggregated to build an ensemble outlier detection framework
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 38 / 47
39. Application to Outlier Detection: Indirectly Used
Motivation
Since the outlier detection problem is mainly about unsupervised
learning environment, it is hard to select the appropriate threshold for
decalring outliers
Every outlier detection model outputs different outlier score with
different scale which leads to a difficult problem during constructing
an outlier ensemble model
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 39 / 47
40. Application to Outlier Detection: Indirectly Used
Outlier Score Distributions
(a) Outlier Score Distribution for
Normal Examples
(b) Outlier Score Distributions for
Outliers
Figure: Outlier Score Distributions [7]
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 40 / 47
41. Application to Outlier Detection: Indirectly Used
Basic Idea
Treat an outlier score as an univariate random variable
Assume the label of outlierness as an unobserved latent variable
Estimate the posterior probabilities for the latent variable with
EM-algorithm
1 Model the posterior probability for outlier scores using a sigmoid
function
2 Model the score distribution as a mixture model (mixture of
exponential and Gaussian) and calculate the posterior probabilities via
the Bayes’ rule
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 41 / 47
42. Application to Outlier Detection: Indirectly Used
Bayesian Risk Model
Bayesian risk model minimizes the overall risk associated with some
cost function
For example, in the case of a two-class problem
The Bayes decision rule for a given observation x is to decide ω if:
(λ − λ)P(ω|x) > (λ − λ)P(ω|x)
Where ω, ω are the two classes while λij is the cost of misclassifying
ωj as ωi
Since P(ω|x) =1 − P(ω|x), the preceding inequality suggests that
the appropriate outlier threshold is automatically determined once the
cost functions are known
In the case of a zero-one loss function, the threshold which minimizes
the overall risk is 0.5, where
λij =
0 if i = j
1 if i = j
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 42 / 47
44. Summary
This slide had an overview on EM-algorithm and its application to
Outlier detection
We’ve checked the basic procedure of the EM-algorithm for estimating
the parameters of model which has the unobserved latent variable
This slide also has shown that the log-likelihood for observed data is
maximized by EM-algorithm through 3 parts: Non-decreasing /
Convergence / Local Maximum
Further more, we’ve seen that the EM-algorithm can be applied for
the Outlier detection not only directly but also indirectly
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 44 / 47
45. References
[1] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. ”Maximum
likelihood from incomplete data via the EM algorithm.” Journal of the Royal
Statistical Society: Series B (Methodological) 39, no. 1 (1977): 1-22.
[2] https://www.math.kth.se/matstat/gru/Statistical%20inference/Lecture8.pdf
[3] https://www.cs.cmu.edu/∼epxing/Class/10708-17/notes-17/10708-scribe-
lecture8.pdf
[4] http://www2.stat.duke.edu/∼sayan/Sta613/2018/lec/emnotes.pdf
[5] Contributions to collaborative clustering and its potential applications on very
high resolution satellite images
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 45 / 47
46. References
[6] Kriegel, Hans-Peter, Peer Kroger, Erich Schubert, and Arthur Zimek.
”Interpreting and unifying outlier scores.” In Proceedings of the 2011 SIAM
International Conference on Data Mining, pp. 13-24. Society for Industrial and
Applied Mathematics, 2011.
[7] Gao, Jing, and Pang-Ning Tan. ”Converting output scores from outlier
detection algorithms into probability estimates.” In Sixth International Conference
on Data Mining (ICDM’06), pp. 212-221. IEEE, 2006.
Jaehyeong Ahn (Konkuk Univ) EM-algorithm September 9, 2020 46 / 47