This document provides an introduction to Dirichlet processes. It begins by motivating the need for nonparametric clustering when the number of clusters in the data is unknown. It then provides an overview of Dirichlet processes and discusses them from multiple perspectives, including samples from a Dirichlet process, the Chinese restaurant process representation, stick breaking construction, and formal definition. It also covers Dirichlet process mixtures and common inference techniques like Markov chain Monte Carlo and variational inference.
Variational inference is a technique for estimating Bayesian models that provides similar precision to MCMC at a greater speed, and is one of the main areas of current research in Bayesian computation. In this introductory talk, we take a look at the theory behind the variational approach and some of the most common methods (e.g. mean field, stochastic, black box). The focus of this talk is the intuition behind variational inference, rather than the mathematical details of the methods. At the end of this talk, you will have a basic grasp of variational Bayes and its limitations.
Variational inference is a technique for estimating Bayesian models that provides similar precision to MCMC at a greater speed, and is one of the main areas of current research in Bayesian computation. In this introductory talk, we take a look at the theory behind the variational approach and some of the most common methods (e.g. mean field, stochastic, black box). The focus of this talk is the intuition behind variational inference, rather than the mathematical details of the methods. At the end of this talk, you will have a basic grasp of variational Bayes and its limitations.
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.1: Basics of Hypothesis Testing
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.1: Basics of Hypothesis Testing
Machine Learning and Data Mining: 09 Clustering: Density-based, Grid-based, M...Pier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This lecture introduces density-based, grid-based, and model-based clustering
In this paper a new form of the Hosszu-Gluskin theorem is presented in terms of polyadic powers and using the language of diagrams. It is shown that the Hosszu-Gluskin chain formula is not unique and can be generalized ("deformed") using a parameter q which takes special integer values. A version of the "q-deformed" analog of the Hosszu-Gluskin theorem in the form of an invariance is formulated, and some examples are considered. The "q-deformed" homomorphism theorem is also given.
We approach the screening problem - i.e. detecting which inputs of a computer model significantly impact the output - from a formal Bayesian model selection point of view. That is, we place a Gaussian process prior on the computer model and consider the $2^p$ models that result from assuming that each of the subsets of the $p$ inputs affect the response. The goal is to obtain the posterior probabilities of each of these models. In this talk, we focus on the specification of objective priors on the model-specific parameters and on convenient ways to compute the associated marginal likelihoods. These two problems that normally are seen as unrelated, have challenging connections since the priors proposed in the literature are specifically designed to have posterior modes in the boundary of the parameter space, hence precluding the application of approximate integration techniques based on e.g. Laplace approximations. We explore several ways of circumventing this difficulty, comparing different methodologies with synthetic examples taken from the literature.
Authors: Gonzalo Garcia-Donato (Universidad de Castilla-La Mancha) and Rui Paulo (Universidade de Lisboa)
최근 이수가 되고 있는 Bayesian Deep Learning 관련 이론과 최근 어플리케이션들을 소개합니다. Bayesian Inference 의 이론에 관해서 간단히 설명하고 Yarin Gal 의 Monte Carlo Dropout 의 이론과 어플리케이션들을 소개합니다.
My talk at the International Conference on Monte Carlo Methods and Applications (MCM2032) related to advances in mathematical aspects of stochastic simulation and Monte Carlo methods at Sorbonne Université June 28, 2023, about my recent works (i) "Numerical Smoothing with Hierarchical Adaptive Sparse Grids and Quasi-Monte Carlo Methods for Efficient Option Pricing" (link: https://doi.org/10.1080/14697688.2022.2135455), and (ii) "Multilevel Monte Carlo with Numerical Smoothing for Robust and Efficient Computation of Probabilities and Densities" (link: https://arxiv.org/abs/2003.05708).
Workshop: Numerical Analysis of Stochastic Partial Differential Equations (NASPDE), in Network Eurandom at Eindhoven University of Technology, May 16, 2023, about my recent works (i) "Numerical Smoothing with Hierarchical Adaptive Sparse Grids and Quasi-Monte Carlo Methods for Efficient Option Pricing" (link: https://doi.org/10.1080/14697688.2022.2135455), and (ii) "Multilevel Monte Carlo with Numerical Smoothing for Robust and Efficient Computation of Probabilities and Densities" (link: https://arxiv.org/abs/2003.05708).
We present recent result on the numerical analysis of Quasi Monte-Carlo quadrature methods, applied to forward and inverse uncertainty quantification for elliptic and parabolic PDEs. Particular attention will be placed on Higher
-Order QMC, the stable and efficient generation of
interlaced polynomial lattice rules, and the numerical analysis of multilevel QMC Finite Element discretizations with applications to computational uncertainty quantification.
A generalized class of normalized distance functions called Q-Metrics is described in this presentation. The Q-Metrics approach relies on a unique functional, using a single bounded parameter (Lambda), which characterizes the conventional distance functions in a normalized per-unit metric space. In addition to this coverage property, a distinguishing and extremely attractive characteristic of the Q-Metric function is its low computational complexity. Q-Metrics satisfy the standard metric axioms. Novel networks for classification and regression tasks are defined and constructed using Q-Metrics. These new networks are shown to outperform conventional feed forward back propagation networks with the same size when tested on real data sets.
A generalized class of normalized distance functions called Q-Metrics is described in this presentation. The Q-Metrics approach relies on a unique functional, using a single bounded parameter Lambda, which characterizes the conventional distance functions in a normalized per-unit metric space. In addition to this coverage property, a distinguishing and extremely attractive characteristic of the Q-Metric function is its low computational complexity. Q-Metrics satisfy the standard metric axioms. Novel networks for classification and regression tasks are defined and constructed using Q-Metrics. These new networks are shown to outperform conventional feed forward back propagation networks with the same size when tested on real data sets.
We examine the effectiveness of randomized quasi Monte Carlo (RQMC) to improve the convergence rate of the mean integrated square error, compared with crude Monte Carlo (MC), when estimating the density of a random variable X defined as a function over the s-dimensional unit cube (0,1)^s. We consider histograms and kernel density estimators. We show both theoretically and empirically that RQMC estimators can achieve faster convergence rates in
some situations.
This is joint work with Amal Ben Abdellah, Art B. Owen, and Florian Puchhammer.
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
The problem of network classification consists on assigning a finite set of labels to the nodes of the graphs; the underlying assumption is that nodes with the same label tend to be connected via strong paths in the graph. This is similar to the assumptions made by semi-supervised learning algorithms based on graphs, which build an artificial graph from vectorial data. Such semi-supervised algorithms are based on label propagation principles and their accuracy heavily relies on the structure (presence of edges) in the graph.
In this talk I will discuss ideas of how to perform sampling in the network graph, thus sparsifying the structure in order to apply semi-supervised algorithms and compute efficiently the classification function on the network. I will show very preliminary experiments indicating that the sampling technique has an important effect on the final results and discuss open theoretical and practical questions that are to be solved yet.
The Probability that a Matrix of Integers Is DiagonalizableJay Liew
The Probability that a
Matrix of Integers Is Diagonalizable
Andrew J. Hetzel, Jay S. Liew, and Kent E. Morrison
1. INTRODUCTION. It is natural to use integer matrices for examples and exercises
when teaching a linear algebra course, or, for that matter, when writing a textbook in
the subject. After all, integer matrices offer a great deal of algebraic simplicity for particular
problems. This, in turn, lets students focus on the concepts. Of course, to insist
on integer matrices exclusively would certainly give the wrong idea about many important
concepts. For example, integer matrices with integer matrix inverses are quite
rare, although invertible integer matrices (over the rational numbers) are relatively
common. In this article, we focus on the property of diagonalizability for integer matrices
and pose the question of the likelihood that an integer matrix is diagonalizable.
Specifically, we ask: What is the probability that an n × n matrix with integer entries is
diagonalizable over the complex numbers, the real numbers, and the rational numbers,
respectively?
Divergence center-based clustering and their applications
Gentle Introduction to Dirichlet Processes
1. Dirichlet Processes
A gentle tutorial
SELECT Lab Meeting
October 14, 2008
Khalid El-Arini
2. Motivation
We are given a data set, and are told that it was
generated from a mixture of Gaussian distributions.
Unfortunately, no one has any idea how many Gaussians
produced the data.
2
3. Motivation
We are given a data set, and are told that it was
generated from a mixture of Gaussian distributions.
Unfortunately, no one has any idea how many Gaussians
produced the data.
3
4. Motivation
We are given a data set, and are told that it was
generated from a mixture of Gaussian distributions.
Unfortunately, no one has any idea how many Gaussians
produced the data.
4
5. What to do?
We can guess the number of clusters, run Expectation
Maximization (EM) for Gaussian Mixture Models, look at
the results, and then try again…
We can run hierarchical agglomerative clustering, and cut
the tree at a visually appealing level…
We want to cluster the data in a statistically principled
manner, without resorting to hacks.
5
6. Other motivating examples
Brain Imaging: Model an unknown number of spatial
activation patterns in fMRI images [Kim and Smyth, NIPS
2006]
Topic Modeling: Model an unknown number of topics
across several corpora of documents [Teh et al. 2006]
…
6
7. Overview
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
7
8. The Dirichlet Distribution
Let
We write:
P m
¡( k ®k ) Y ® ¡1
P (µ1; µ2; : : : ; µm) = Q µk k
k ¡(®k ) k=1
Samples from the distribution lie in the m-1 dimensional
probability simplex
8
9. The Dirichlet Distribution
Let
We write:
Distribution over possible parameter vectors for a multinomial
distribution, and is the conjugate prior for the multinomial.
Beta distribution is the special case of a Dirichlet for 2
dimensions.
Thus, it is in fact a ―distribution over distributions.‖
9
10. Dirichlet Process
A Dirichlet Process is also a distribution over distributions.
Let G be Dirichlet Process distributed:
G ~ DP(α, G0)
G0 is a base distribution
α is a positive scaling parameter
G is a random probability measure that has the same support
as G0
10
12. Dirichlet Process
G ~ DP(α, G0)
G0 is continuous, so the probability that any two samples are
equal is precisely zero.
However, G is a discrete distribution, made up of a countably
infinite number of point masses [Blackwell]
Therefore, there is always a non-zero probability of two samples colliding
12
13. Overview
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
13
14. Samples from a Dirichlet Process
G ~ DP(α, G0)
Xn | G ~ G for n = {1, …, N} (iid given G)
Marginalizing out G introduces dependencies
between the Xn variables
G
Xn
N
14
15. Samples from a Dirichlet Process
Assume we view these variables in a specific order, and are interested in the
behavior of Xn given the previous n - 1 observations.
Let there be K unique values for the variables:
15
16. Samples from a Dirichlet Process
Chain rule
P(partition) P(draws)
Notice that the above formulation of the joint distribution does
not depend on the order we consider the variables.
16
17. Samples from a Dirichlet Process
Let there be K unique values for the variables:
Can rewrite as:
17
18. Blackwell-MacQueen Urn Scheme
G ~ DP(α, G0)
Xn | G ~ G
Assume that G0 is a distribution over colors, and that each Xn
represents the color of a single ball placed in the urn.
Start with an empty urn.
On step n:
With probability proportional to α, draw Xn ~ G0, and add a ball of
that color to the urn.
With probability proportional to n – 1 (i.e., the number of balls
currently in the urn), pick a ball at random from the urn. Record its
color as Xn, and return the ball into the urn, along with a new one of
the same color.
[Blackwell and Macqueen, 1973]
18
19. Chinese Restaurant Process
Consider a restaurant with infinitely many tables, where the Xn‘s
represent the patrons of the restaurant. From the above
conditional probability distribution, we can see that a customer is
more likely to sit at a table if there are already many people
sitting there. However, with probability proportional to α, the
customer will sit at a new table.
Also known as the ―clustering effect,‖ and can be seen in the
setting of social clubs. [Aldous]
19
21. Stick Breaking
So far, we‘ve just mentioned properties of a distribution
G drawn from a Dirichlet Process
In 1994, Sethuraman developed a constructive way of
forming G, known as ―stick breaking‖
21
22. Stick Breaking
1. Draw X1* from G0
2. Draw v1 from Beta(1, α)
3. π1 = v1
4. Draw X2* from G0
5. Draw v2 from Beta(1, α)
6. π2 = v2(1 – v1)
…
22
23. Formal Definition (not constructive)
Let α be a positive, real-valued scalar
Let G0 be a non-atomic probability distribution over
support set A
If G ~ DP(α, G0), then for any finite set of partitions
of A:
A1 A2 A3 A4 A5 A6 A7
23
24. Overview
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
24
25. Finite Mixture Models
A finite mixture model assumes that the data come from a
mixture of a finite number of distributions.
α
π ~ Dirichlet(α/K,…, α/K)
G0
π
Component
cn ~ Multinomial(π)
labels
cn η*k ηk ~ G0
k=1…K yn | cn, η1,… ηK ~ F( · | ηcn)
yn
n=1…N
25
26. Infinite Mixture Models
An infinite mixture model assumes that the data come
from a mixture of an infinite number of distributions
α
π ~ Dirichlet(α/K,…, α/K)
G0
π
cn ~ Multinomial(π)
cn η*k ηk ~ G0
k=1…K yn | cn, η1,… ηK ~ F( · | ηcn)
Take limit as K goes to ∞
yn
Note: the N data points still come from at most N
n=1…N
different components [Rasmussen 2000]
26
27. Dirichlet Process Mixture
G0 countably infinite number
of point masses
G
α
draw N times from G to get
parameters for different mixture
components
ηn
If ηn were drawn from, e.g., a Gaussian, no two values
would be the same, but since they are drawn from a
Dirichlet Process-distributed distribution, we expect
yn a clustering of the ηn
# unique values for ηn = # mixture components
n=1…N
27
29. Overview
Dirichlet distribution, and Dirichlet Process introduction
Dirichlet Processes from different perspectives
Samples from a Dirichlet Process
Chinese Restaurant Process representation
Stick Breaking
Formal Definition
Dirichlet Process Mixtures
Inference
29
30. Inference for Dirichlet Process Mixtures
Expectation Maximization (EM) G0
is generally used for inference in
G
a mixture model, but G is α
nonparametric, making EM
difficult
ηn
Markov Chain Monte Carlo
techniques [Neal 2000]
yn
Variational Inference [Blei and
Jordan 2006] n=1…N
30
31. Aside: Monte Carlo Methods
[Basic Integration]
We want to compute the integral,
where f(x) is a probability density function.
In other words, we want Ef [h(x)].
We can approximate this as:
where X1, X2, …, XN are sampled from f.
By the law of large numbers,
[Lafferty and Wasserman]
31
32. Aside: Monte Carlo Methods
[What if we don’t know how to sample from f?]
Importance Sampling
Markov Chain Monte Carlo (MCMC)
Goal is to generate a Markov chain X1, X2, …, whose stationary
distribution is f.
If so, then
N
1 X p
h(Xi) ¡! I
N i=1
(under certain conditions)
32
33. Aside: Monte Carlo Methods
[MCMC I: Metropolis-Hastings Algorithm]
Goal: Generate a Markov chain with stationary distribution f(x)
Initialization:
Let q(y | x) be an arbitrary distribution that we know how to sample
from. We call q the proposal distribution.
A common choice
Arbitrarily choose X0. N(x, b2) for b > 0
is
If q is symmetric,
Assume we have generated X0, X1, …, Xi. To generate
simplifies to:
Xi+1:
Generate a proposal value Y ~ q(y|Xi)
Evaluate r ≡ r(Xi, Y) where:
Set:
with probability r
with probability 1-r
33 [Lafferty and Wasserman]
34. Aside: Monte Carlo Methods
[MCMC II: Gibbs Sampling]
Goal: Generate a Markov chain with stationary distribution
f(x, y) [Easily extendable to higher dimensions.]
Assumption:
We know how to sample from the conditional distributions
fX|Y(x | y) and fY|X(y | x) If not, then we run one iteration
Initialization: of Metropolis-Hastings each
time we need to sample from a
Arbitrarily choose X0,Y0. conditional.
Assume we have generated (X0,Y0), …, (Xi,Yi). To
generate (Xi+1,Yi+1):
Xi+1 ~ fX|Y(x | Yi)
Yi+1 ~ fY|X(y | Xi+1)
34 [Lafferty and Wasserman]
35. MCMC for Dirichlet Process Mixtures
[Overview]
We would like to sample from the
posterior distribution: G0
P(η1,…, ηN | y1,…yN) G
If we could, we would be able to α
determine:
how many distinct components are likely
contributing to our data. ηn
what the parameters are for each
component.
[Neal 2000] is an excellent resource yn
describing several MCMC algorithms
for solving this problem. n=1…N
We will briefly take a look at two of them.
35
36. MCMC for Dirichlet Process Mixtures
[Infinite Mixture Model representation]
MCMC algorithms that are based
on the infinite mixture model α
representation of Dirichlet Process
Mixtures are found to be simpler to G0
implement and converge faster than π
those based on the direct
representation.
cn η*k
Thus, rather than sampling for η1,…,
ηN directly, we will instead sample k=1…∞
for the component indicators c1, …,
cN, as well as the component yn
parameters η*c, for all c in {c1, …, n=1…N
cN}
[Neal 2000]
36
37. MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Conjugate Priors]
Assume current state of Markov chain consists of c1, …, cN,
as well as the component parameters η*c, for all c in {c1, …,
cN}.
To generate the next sample:
1. For i = 1,…,N:
If ci is currently a singleton, remove η*ci from the state.
Draw a new value for ci from the conditional distribution:
for existing c
for new c
If the new ci is not associated with any other observation,
draw a value for η*ci from: [Neal 2000, Algorithm 2]
37
38. MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Conjugate Priors]
2. For all c in {c1, …, cN}:
Draw a new value for η*c from the posterior distribution
based on the prior G0 and all the data points currently
associated with component c:
This algorithm breaks down when G0 is not a conjugate
prior.
[Neal 2000, Algorithm 2]
38
39. MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Auxiliary Parameters]
Recall from the Gibbs sampling overview: if we do not know
how to sample from the conditional distributions, we can
interleave one or more Metropolis-Hastings steps.
We can apply this technique when G0 is not a conjugate prior, but it
can lead to convergence issues [Neal 2000, Algorithms 5-7]
Instead, we will use auxiliary parameters.
Previously, the state of our Markov chain consisted of c1,…, cN,
as well as component parameters η*c, for all c in {c1, …, cN}.
When updating ci, we either:
choose an existing component c from c-i (i.e., all cj such that j ≠ i).
choose a brand new component.
In the previous algorithm, this involved integrating with respect to G0,
which is difficult in the non-conjugate case.
[Neal 2000, Algorithm 8]
39
40. MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Auxiliary Parameters]
When updating ci, we either:
choose an existing component c from c-i (i.e., all cj such that j ≠ i).
choose a brand new component.
Let K-i be the number of distinct components c in c-i.
WLOG, let these components c-i have values in {1, …, K-i}.
Instead of integrating over G0, we will add m auxiliary
parameters, each corresponding to a new component
independently drawn from G0 :
[η*K +1 , …, η*K-i +m ]
-i
Recall that the probability of selecting a new component is proportional to α.
Here, we divide α equally among the m auxiliary components.
[Neal 2000, Algorithm 8]
40
41. MCMC for Dirichlet Process Mixtures
[Gibbs Sampling with Auxiliary Parameters]
Probability: 2 1 2 1 α/3 α/3 α/3
(proportional to)
Components: η*1 η*2 η* 3 η*4 η*5 η* 6 η*7
Each is a fresh draw from G0
Data: y1 y2 y3 y4 y5 y6 y7
cj: 1 2 1 3 4 3 ?
This takes care of sampling for ci in the non-conjugate case.
A Metropolis-Hastings step can be used to sample η*c.
See Neal‘s paper for more details.
[Neal 2000, Algorithm 8]
41
42. Conclusion
We now have a statistically principled mechanism for
solving our original problem.
This was intended as a general and fairly high level
overview of Dirichlet Processes.
Topics left out include Hierarchical Dirichlet Processes,
variational inference for Dirichlet Processes, and many more.
Teh‘s MLSS ‗07 tutorial provides a much deeper and more
42 detailed take on DPs—highly recommended!
43. Acknowledgments
Much thanks goes to David Blei.
Some material for this presentation was inspired by slides
from Teg Grenager, Zoubin Ghahramani, and Yee Whye Teh
43
44. References
David Blackwell and James B. MacQueen. ―Ferguson Distributions via Polya Urn Schemes.‖ Annals
of Statistics 1(2), 1973, 353-355.
David M. Blei and Michael I. Jordan. ―Variational inference for Dirichlet process mixtures.‖ Bayesian
Analysis 1(1), 2006.
Thomas S. Ferguson. ―A Bayesian Analysis of Some Nonparametric Problems‖ Annals of Statistics
1(2), 1973, 209-230.
Zoubin Gharamani. ―Non-parametric Bayesian Methods.‖ UAI Tutorial July 2005.
Teg Grenager. ―Chinese Restaurants and Stick Breaking: An Introduction to the Dirichlet Process‖
R.M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of
Computational and Graphical Statistics, 9:249-265, 2000.
C.E. Rasmussen. The Infinite Gaussian Mixture Model. Advances in Neural Information Processing
Systems 12, 554-560. (Eds.) Solla, S. A., T. K. Leen and K. R. Müller, MIT Press (2000).
Y.W. Teh. ―Dirichlet Processes.‖ Machine Learning Summer School 2007 Tutorial and Practical
Course.
Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei. ―Hierarchical Dirichlet Processes.‖ J. American
Statistical Association 101(476):1566-1581, 2006.
44