Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Chinese Restaurant Process by Mohitdeep Singh 2335 views
- Approximate Bayesian Computation on... by Michael Stumpf 3816 views
- Topic Modeling for Learning Analyti... by Vitomir Kovanovic 8467 views
- Dirichlet Process by Sangwoo Mo 108 views
- Chinese Food Restaurants by Chef Han 173 views
- KAIST Web Engineering Lab Introduct... by webeng-kaist 1978 views

1,743 views

Published on

http://prml.yonsei.ac.kr/

I talked about dirichlet distribution, dirichlet process and HDP.

No Downloads

Total views

1,743

On SlideShare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

88

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes JinYeong Bak Department of Computer Science KAIST, Daejeon South Korea jy.bak@kaist.ac.kr August 22, 2013 Part of this slides adopted from presentation by Yee Whye Teh (y.w.teh@stats.ox.ac.uk). JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121
- 2. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121
- 3. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121
- 4. Introduction Bayesian topic models Latent Dirichlet Allocation (LDA) [BNJ03] Hierarchical Dircihlet Processes (HDP) [TJBB06] In this talk, Dirichlet distribution, Dircihlet process Concept of Hierarchical Dircihlet Processes (HDP) How to infer the latent variables in HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121
- 5. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121
- 6. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
- 7. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
- 8. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
- 9. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
- 10. Motivation What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121
- 11. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121
- 12. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
- 13. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
- 14. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
- 15. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
- 16. Topic Modeling Each topic has word distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121
- 17. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
- 18. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
- 19. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
- 20. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
- 21. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
- 22. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
- 23. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
- 24. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
- 25. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
- 26. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
- 27. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
- 28. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
- 29. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121
- 30. Latent Dirichlet Allocation What we can see Words in documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121
- 31. Latent Dirichlet Allocation What we want to see JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121
- 32. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? => Topic proportion of each document How can we describe the topics? => Word distribution of each topic JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121
- 33. Latent Dirichlet Allocation What we can see: w What we want to see: θ,z,β ∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η) p(w|α,η) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
- 34. Latent Dirichlet Allocation What we can see: w What we want to see: θ,z,β ∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η) p(w|α,η) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
- 35. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should ﬁnd the best number of topics Q) Can we get it from data automatically? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121
- 36. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should ﬁnd the best number of topics Q) Can we get it from data automatically? A) Hierarchical Dircihlet Processes JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121
- 37. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121
- 38. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
- 39. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
- 40. Dice modeling Think about the probability of a number from dices According to the textbook, it is widely known as uniform. => 1 6 for 6 dimentional dice Is it true? Ans) No! JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121
- 41. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Let’s imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
- 42. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Let’s imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
- 43. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
- 44. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
- 45. Dirichlet distribution Deﬁnition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
- 46. Dirichlet distribution Deﬁnition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
- 47. Dirichlet distribution Deﬁnition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
- 48. Latent Dirichlet Allocation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121
- 49. Property of Dirichlet distribution Density plots [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121
- 50. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121
- 51. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α) ∴ Posterior (θ|X) ∼ Dir(α +n) Proof) p(θ|X) = p(X|θ)p(θ) p(X) ∝ p(X|θ)p(θ) = n! x1!···xK ! K ∏ k=1 θxk k · Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k = C K ∏ k=1 θαk +xk −1 k = Dir(α +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
- 52. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α) ∴ Posterior (θ|X) ∼ Dir(α +n) Proof) p(θ|X) = p(X|θ)p(θ) p(X) ∝ p(X|θ)p(θ) = n! x1!···xK ! K ∏ k=1 θxk k · Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k = C K ∏ k=1 θαk +xk −1 k = Dir(α +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
- 53. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
- 54. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
- 55. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
- 56. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
- 57. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121
- 58. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
- 59. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
- 60. Dirichlet Process Deﬁnition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal deﬁnition (Ω,B) is a measurable space G0 is a distribution over sample space Ω α0 is a positive real number G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any ﬁnite measurable partition (A1,...,AR) of Ω (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
- 61. Dirichlet Process Deﬁnition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal deﬁnition (Ω,B) is a measurable space G0 is a distribution over sample space Ω α0 is a positive real number G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any ﬁnite measurable partition (A1,...,AR) of Ω (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
- 62. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make ﬁnite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every ﬁnite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
- 63. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make ﬁnite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every ﬁnite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
- 64. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make ﬁnite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every ﬁnite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
- 65. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make ﬁnite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every ﬁnite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
- 66. Posterior Dirichlet Processes For every ﬁnite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
- 67. Posterior Dirichlet Processes For every ﬁnite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
- 68. Posterior Dirichlet Processes For every ﬁnite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
- 69. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
- 70. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
- 71. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
- 72. Blackwell-MacQueen Urn Scheme Nth sample θN|θ1,...,N−1,G ∼ G G|θ1,...,N−1 ∼ DP(α0 +N −1, α0G0 +∑N−1 n=1 δθn α0 +N −1 ) ⇐⇒ θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 G|θ1,...,N ∼ DP(α0 +N, α0G0 +∑N n=1 δθn α0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121
- 73. Blackwell-MacQueen Urn Scheme Blackwell-MacQueen urn scheme produces a sequence θ1,θ2,... with the following conditionals θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 As Polya Urn analogy Inﬁnite number of ball colors Empty urn Filling Polya urn process (n starts 1) With probability α0, pick a new color from the set of inﬁnite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121
- 74. Chinese Restaurant Process Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme With probability α0, pick a new color from the set of inﬁnite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color θs can take same values, θi = θj There are K < N distinct values, φ1,...,φK It works as partition of Ω θ1,θ2,...,θN induces to φ1,...,φK The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
- 75. Chinese Restaurant Process Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme With probability α0, pick a new color from the set of inﬁnite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color θs can take same values, θi = θj There are K < N distinct values, φ1,...,φK It works as partition of Ω θ1,θ2,...,θN induces to φ1,...,φK The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
- 76. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has inﬁnite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the ﬁrst table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
- 77. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has inﬁnite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the ﬁrst table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
- 78. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has inﬁnite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the ﬁrst table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
- 79. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has inﬁnite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the ﬁrst table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
- 80. Chinese Restaurant Process The CRP exhibits the clustering property of DP Tables are clusters, φk ∼ G0 Customers are the actual realizations, θn = φzn where zn ∈ {1,...,K} JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121
- 81. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
- 82. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
- 83. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
- 84. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
- 85. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
- 86. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
- 87. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
- 88. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
- 89. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
- 90. Stick Breaking Construction Do this repeatly with distinct values, φ1,φ2,··· G ∼ DP(α0,G0) G = β1δφ1 +(1 −β1)G1 G = β1δφ1 +(1 −β1)(β2δφ2 +(1 −β2)G2) ... G = ∞ ∑ k=1 πk δφk where πk = βk k−1 ∏ i=1 (1 −βi ), ∞ ∑ k=1 πk = 1 βk ∼ Beta(1,α0) φk ∼ G0 Draws from the DP looks like a sum of point masses, with masses drawn from a stick-breaking construction. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121
- 91. Stick Breaking Construction Summary) G = ∞ ∑ k=1 πk δφk πk = βk k−1 ∏ i=1 (1 −βi ), ∞ ∑ k=1 πk = 1 βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121
- 92. Summary of DP Deﬁnition G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any ﬁnite measurable partition (A1,...,Ar ) of Ω (G(A1),...,G(Ar )) ∼ Dir(α0G0(A1),...,α0G0(Ar )) Chinese Restaurant Process Stick Breaking Construction JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121
- 93. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121
- 94. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) Each θn is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
- 95. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) Each θn is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
- 96. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently deﬁne θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
- 97. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently deﬁne θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
- 98. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently deﬁne θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
- 99. Dirichlet Process Mixture Models ⇐⇒ xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121
- 100. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121
- 101. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
- 102. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
- 103. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = ∞ ∑ k=1 π1k δφ1k , G2 = ∞ ∑ k=1 π2k δφ2k φ1k ,φ2k ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
- 104. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = ∞ ∑ k=1 π1k δφ1k , G2 = ∞ ∑ k=1 π2k δφ2k φ1k ,φ2k ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
- 105. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 ∼ DP(γ,H) G1,G2|G0 ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
- 106. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 ∼ DP(γ,H) G1,G2|G0 ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
- 107. Hierarchical Dirichlet Processes Making G0 discrete forces shared cluster between G1 and G2 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121
- 108. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 ∼ DP(γ,H) Gd |G0 ∼ DP(α0,G0) The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ k=1 πdk δφk πdk = πdk k−1 ∏ i=1 (1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121
- 109. Chinese Restaurant Franchise Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121
- 110. Chinese Restaurant Franchise Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... Draw θd 1,θd 2,... from a Blackwell-MacQueen Urn Scheme θd 1,θd 2,... induces to φd 1,φd 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121
- 111. Chinese Restaurant Franchise G0 ∼ DP(γ,H), φk ∼ H Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... Draw θd 1,θd 2,... from a Blackwell-MacQueen Urn Scheme θd 1,θd 2,... induces to φd 1,φd 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121
- 112. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has inﬁnite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the ﬁrst table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
- 113. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has inﬁnite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the ﬁrst table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
- 114. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has inﬁnite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the ﬁrst table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
- 115. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has inﬁnite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the ﬁrst table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
- 116. Chinese Restaurant Franchise JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121
- 117. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
- 118. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
- 119. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121
- 120. Gibbs Sampling Deﬁnition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
- 121. Gibbs Sampling Deﬁnition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
- 122. Collapsed Gibbs sampling A collapsed Gibbs sampling integrates out one or more variables when sampling for some other variable. Example) There are three latent variables A,B and C. Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially But when we integrate out B, Sampling only p(A|C), p(C|A) sequentially JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121
- 123. Review) Dirichlet Process Mixture Models ⇐⇒ xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121
- 124. Review) Blackwell-MacQueen Urn Scheme for DP Nth sample θN|θ1,...,N−1,G ∼ G G|θ1,...,N−1 ∼ DP(α0 +N −1, α0G0 +∑N−1 n=1 δθn α0 +N −1 ) ⇐⇒ θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 G|θ1,...,N ∼ DP(α0 +N, α0G0 +∑N n=1 δθn α0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121
- 125. Review) Chinese Restaurant Franchise Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the ﬁrst table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121
- 126. Alternative form of HDP G0 ∼ DP(γ,H), φdt ∼ G0 ∴ G0|φdt ,... ∼ DP(γ +m, γH+∑K k=1 mk δφk γ+m ) Then G0 is given as G0 = K ∑ k=1 βk δφk +βuGu where Gu ∼ DP(γ,H) π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ) p(φk |·) ∝ h(φk ) ∏ dn:zdn=k f(xdn|φk ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
- 127. Alternative form of HDP G0 ∼ DP(γ,H), φdt ∼ G0 ∴ G0|φdt ,... ∼ DP(γ +m, γH+∑K k=1 mk δφk γ+m ) Then G0 is given as G0 = K ∑ k=1 βk δφk +βuGu where Gu ∼ DP(γ,H) π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ) p(φk |·) ∝ h(φk ) ∏ dn:zdn=k f(xdn|φk ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
- 128. Hierarchical Dirichlet Processes ⇐⇒ xdn ∼ F(θn) θn ∼ Gd Gd ∼ DP(α0,G0) G0 ∼ DP(γ,H) ⇐⇒ xn ∼ Mult(φzdn ) zdn ∼ Mult(θd ) φk ∼ Dir(η) θd ∼ Dir(α0π) π ∼ Dir(m.1,...,m.K ,γ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121
- 129. Gibbs Sampling for HDP Joint distribution p(θ,z,φ,x,π,m|α0,η,γ) = p(π|m,γ) K ∏ k=1 p(φk |η) D ∏ d=1 p(θd |α0,π) N ∏ n=1 p(zdn|θd ) p(xdn|zdn,φ) Integrate out θ,φ p(z,x,π,m|α0,η,γ) = Γ(∑K k=1 m.k +γ) ∏K k=1 Γ(m.k )Γ(γ) K ∏ k=1 πm.k −1 k π γ−1 K+1 K ∏ k=1 Γ(∑V v=1 ηv ) ∏V v=1 Γ(ηv ) ∏V v=1 Γ(ηv +nk (·),v ) Γ(∑V v=1 ηv +nk (·),v ) M ∏ d=1 Γ(∑K k=1 α0πk ) ∏K k=1 Γ(α0πk ) ∏K k=1 Γ(α0πk +nk d,(·)) Γ(∑K k=1 α0πk +nk d,(·)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121
- 130. Gibbs Sampling for HDP Full conditional distribution of z p(z(d ,n ) = k |z−(d ,n ) ,m,π,x,·) = p(z(d ,n ) = k ,z−(d ,n ),m,π,x|·) p(z−(d ,n ),m,π,x|·) ∝ p(z(d ,n ) = k ,z−(d ,n ) ,m,π,x|·) ∝ α0πk +n k ,−(d ,n ) d ,(·) (ηv +n k ,−(d ,n ) (·),v ) (∑V v=1 ηv +n k ,−(d ,n ) (·),v ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121
- 131. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the ﬁrst kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
- 132. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the ﬁrst kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
- 133. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the ﬁrst kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
- 134. Gibbs Sampling for HDP Full conditional distribution of π (π1,π2,...,πK ,πu)|· ∼ Dir(m.1,m.2,...,m.K ,γ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121
- 135. Gibbs Sampling for HDP Algorithm 1 Gibbs Sampling for HDP 1: Initialize all latent variables as random 2: repeat 3: for Each document d do 4: for Each word n in document d do 5: Sample z(d,n) ∼ Mult α0πk +n k ,−(d,n) d ,(·) (ηv +n k ,−(d,n) (·),v ) (∑V v=1 ηv +n k ,−(d,n) (·),v ) 6: end for 7: Sample m ∼ Mult Γ(α0πk ) Γ(α0πk +nk d,(·),(·) ) s(nk d,(·),(·),m)(α0πk )m 8: Sample β ∼ Dir(m.1,m.2,...,m.K ,γ) 9: end for 10: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121
- 136. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121
- 137. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 ∼ DP(γ,H) Gd |G0 ∼ DP(α0,G0) The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ k=1 πdk δφk πdk = πdk k−1 ∏ i=1 (1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121
- 138. Alternative Stick Breaking Construction Problem) Original Stick Breaking Construction is weights βk and πdk are tightly correlated βk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,γ) πdk = πdk k−1 ∏ i=1 (1 −πdi ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) Alternative Stick Breaking Construction for each document [FSJW08] ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) Gd = ∞ ∑ t=1 πdt δψdt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121
- 139. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ t=1 πdt δψdt ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121
- 140. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,γ) Gd = ∞ ∑ t=1 πdt δψdt ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) To connect ψdt and φk We add auxiliary variable cdt ∼ Mult(β) Then ψdt = φcdt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121
- 141. Alternative Stick Breaking Construction Generative process 1 For each global-level topic k ∈ {1,...,∞}: 1 Draw topic word proportions, φk ∼ Dir(η) 2 Draw a corpus breaking proportion, βk ∼ Beta(1,γ) 2 For each document d ∈ {1,...,D}: 1 For each document-level topic t ∈ {1,...,∞}: 1 Draw document-level topic indices, cdt ∼ Mult(σ(β )) 2 Draw a document breaking proportion, πdt ∼ Beta(1,α0) 2 For each word n ∈ {1,...,N}: 1 Draw a topic index zdn ∼ Mult(σ(πd )) 2 Generate a word wdn ∼ Mult(φcdzdn ), 3 where σ(β ) ≡ {β1,β2,...},βk = βk ∏k−1 i=1 (1 −βi ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121
- 142. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modiﬁed one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
- 143. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modiﬁed one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
- 144. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log ∑ {Z} p(Z,X) = log ∑ {Z} p(Z,X) q(Z|X) q(Z|X) = log ∑ {Z} q(Z|X) p(Z,X) q(Z|X) ≥ ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X)− ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) = ∑ Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensen’s inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
- 145. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log ∑ {Z} p(Z,X) = log ∑ {Z} p(Z,X) q(Z|X) q(Z|X) = log ∑ {Z} q(Z|X) p(Z,X) q(Z|X) ≥ ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X)− ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) = ∑ Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensen’s inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
- 146. KL-divergence of p from q logp(X) = ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) +DKL(q||p) Log evidence logp(X) is ﬁxed with respect to q Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121
- 147. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modiﬁed one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 4 Find lower bound of logp(X) Maximizing it 4 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121
- 148. Variational Inference for HDP q(β,φ,π,c,z) = K ∏ k=1 q(φk |λk ) K−1 ∏ k=1 q(βk |a1 k ,a2 k ) D ∏ d=1 T ∏ t=1 q(cdt |ζdt ) T−1 ∏ t=1 q(πdt |γ1 dt ,γ2 dt ) N ∏ n=1 q(zdn|ϕdn) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121
- 149. Variational Inference for HDP Find lower bound of logp(w|α0,γ,η) lnp(w|α0,γ,η) = ln β φ π ∑ c ∑ z p(w,β,φ,π,c,z|α0,γ,η) dβ dφ dπ = ln β φ π ∑ c ∑ z p(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) q(β,φ,π,c,z) dβ dφ dπ ≥ β φ π ∑ c ∑ z ln p(w,β,φ,π,c,z|α0,γ,η) q(β,φ,π,c,z) ·q(β,φ,π,c,z) dβ dφ dπ ( Jensen’s inequality) = β φ π ∑ c ∑ z lnp(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) dβ dφ dπ − β φ π ∑ c ∑ z lnq(β,φ,π,c,z)·q(β,φ,π,c,z) dβ dφ dπ = Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121
- 150. Variational Inference for HDP lnp(w|α0,γ,η) ≥ Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)] = Eq[lnp(β|γ)p(φ|η) D ∏ d=1 p(πd |α0)p(cd |β) N ∏ n=1 p(wdn|cd ,zdn,φ)p(zdn|πd )] −Eq[ln K ∏ k=1 q(φk |λk ) K−1 ∏ k=1 q(βk |a1 k ,a2 k ) D ∏ d=1 T ∏ t=1 q(cdt |ζdt ) T−1 ∏ t=1 q(πdt |γ1 dt ,γ2 dt ) N ∏ n=1 q(zdn|ϕdn)] = D ∑ d=1 Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )] −Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1 d ,γ2 d )]−Eq[lnq(zd |ϕd )] +Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1 ,a2 )] We can run Variational EM to maximize lower bound of logp(w|α0,γ,η) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121
- 151. Variational Inference for HDP Maximize lower bound of logp(w|α0,γ,η) Derivative of it with respect to each variational parameter γ1 dt = 1 + N ∑ n=1 ϕdnt , γ2 dt = α0 + N ∑ n=1 T ∑ b=t+1 ϕdnb ζdtk = exp{ k−1 ∑ e=1 (Ψ(a2 e)−Ψ(a1 e +a2 e))+(Ψ(a1 k )−Ψ(a1 k +a2 k )) + N ∑ n=1 V ∑ v=1 wv dnϕdnt (Ψ(λkv )−Ψ( V ∑ l=1 λkl ))} ϕdnt = exp{ t−1 ∑ h=1 (Ψ(γ2 dh)−Ψ(γ1 dh +γ2 dh))+(Ψ(γ1 dt )−Ψ(γ1 dt +γ2 dt )) + K ∑ k=1 V ∑ v=1 wv dnζdtk (Ψ(λkv )−Ψ( V ∑ l=1 λkl ))} a1 k = 1 + D ∑ d=1 T ∑ t=1 ζdtk , a2 k = γ + D ∑ d=1 T ∑ t=1 K ∑ f=k+1 ζdtf λkv = ηv + D ∑ d=1 N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121
- 152. Variational Inference for HDP Maximize lower bound of logp(w|α0,γ,η) Derivative of it with respect to each variational parameter Run Variational EM E step: compute document level parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt M step: compute corpus level parameters a1 k ,a2 k ,λkv Algorithm 2 Variational Inference for HDP 1: Initialize the variational parameters 2: repeat 3: for Each document d do 4: repeat 5: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 6: until Converged 7: end for 8: Compute topic parameters a1 k ,a2 k ,λkv 9: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121
- 153. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121
- 154. Online Variational Inference Stochastic optimization to the variational objective [WPB11] Subsample the documents Compute approximation of the gradient based on subsample Follow that gradient with a decreasing step-size JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121
- 155. Variational Inference for HDP Lower bound of logp(w|α0,γ,η) lnp(w|α0,γ,η) ≥ D ∑ d=1 Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )] −Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1 d ,γ2 d )]−Eq[lnq(zd |ϕd )] +Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1 ,a2 )] = D ∑ d=1 Ld +Lk = Eqj [DLd + 1 D Lk ] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121
- 156. Online Variational Inference for HDP Lower bound of logp(w|α0,γ,η) = Eqj [DLd + 1 D Lk ] Online learning algorithm for HDP Sample a document d Compute its optimal document-level parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt Take the gradient 5 of the corpus level parameters a1 k ,a2 k ,λkv with noise Update corpus level parameters a1 k ,a2 k ,λkv with decreasing learning rate a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) where ρe is the learning rate which satisfy ∑∞ e=1 ρe = ∞, ∑∞ e=1 ρ2 e < ∞ 5 Natural graident is structurally equivalent to the Variational Inference one JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121
- 157. Online Variational Inference for HDP Algorithm 3 Online Variational Inference for HDP 1: Initialize the variational parameters 2: e = 0 3: for Each document d ∈ {1,...,D} do 4: repeat 5: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 6: until Converged 7: e = e +1 8: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] 9: Update topic parameters a1 k ,a2 k ,λkv 10: end for JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121
- 158. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121
- 159. Motivation Problem 1: Inference for HDP takes a long time Problem 2: Continuously expanding corpus necessitates continuous updates of model parameters But updating of model parameters is not possible with plain HDP Must re-train with the entire updated corpus Our Approach: Combine distributed inference and online learning JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121
- 160. Distributed Online HDP Based on variational inference Mini-batch updates via stochastic learning (variational EM) Distribute variational EM using MapReduce JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121
- 161. Distributed Online HDP Algorithm 4 Distributed Online HDP - Driver 1: Initialize the variational parameters 2: e = 0 3: while Run forever do 4: Collect new documents s ∈ {1,...,S} 5: e = e +1 6: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] 7: Run MapReduce job 8: Get result of job and update topic parameters 9: end while JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121
- 162. Distributed Online HDP Algorithm 5 Distributed Online HDP - Mapper 1: Mapper get one document s ∈ {1,...,S} 2: repeat 3: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 4: until Converged 5: Output the sufﬁcient statistics for topic parameters Algorithm 6 Distributed Online HDP - Reducer 1: Reducer get sufﬁcient statistics for each topic parameter 2: Compute changes of topic parameter with sufﬁcient statistics 3: Output the changes of topic parameter JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121
- 163. Experimental Setup Data: 973,266 Twitter conversations, 7.54 tweets / conv Approximately 7,297,000 tweets 60 node Hadoop system Each node with 8 x 2.30GHz cores JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121
- 164. Result Distributed Online HDP runs faster than online HDP Distributed Online HDP preserve the quality of result (perplexity) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121
- 165. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial ﬁeld. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
- 166. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial ﬁeld. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
- 167. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial ﬁeld. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
- 168. Implementation https://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121
- 169. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121
- 170. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121
- 171. HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121
- 172. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121
- 173. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
- 174. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
- 175. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
- 176. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
- 177. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121
- 178. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
- 179. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
- 180. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
- 181. Mini-batch size When mini-batch size is large, distributed online HDP runs faster Perplexity is similar as others JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121
- 182. Summary Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes Chinese Restaurant Franchise Stick Breaking Construction Posterior Inference for HDP Gibbs Sampling Variational Inference Online Learning Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak Implementations are updated in http://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121
- 183. Further Reading Dirichlet Process Dirichlet Process Dirichlet distribution and Dirichlet Process + Indian Buffet Process Bayesian Nonparametric model Machine Learning Summer School - Yee Whye Teh Machine Learning Summer School - Peter Orbanz Introductory article Inference MCMC Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121
- 184. Thank You! JinYeong Bak jy.bak@kaist.ac.kr, linkedin.com/in/jybak Users & Information Lab, KAIST JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121
- 185. References I Charles E Antoniak, Mixtures of dirichlet processes with applications to bayesian nonparametric problems, The annals of statistics (1974), 1152–1174. Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to the dirichlet distribution and related processes, Tech. Report UWEETR-2010-0006, Department of Electrical Engineering, University of Washington, Seattle, WA 98195, December 2010. Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition and machine learning, vol. 1, springer New York, 2006. David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet allocation, the Journal of machine Learning research 3 (2003), 993–1022. Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, An hdp-hmm for systems with state persistence, Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 312–319. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121
- 186. References II Peter D Hoff, A ﬁrst course in bayesian statistical methods, Springer, 2009. Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul, An introduction to variational methods for graphical models, Springer, 1998. Yohan Jo and Alice H. Oh, Aspect and sentiment uniﬁcation model for online review analysis, Proceedings of the fourth ACM international conference on Web search and data mining (New York, NY, USA), WSDM ’11, ACM, 2011, pp. 815–824. Radford M Neal, Markov chain sampling methods for dirichlet process mixture models, Journal of computational and graphical statistics 9 (2000), no. 2, 249–265. Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei, Hierarchical dirichlet processes, Journal of the american statistical association 101 (2006), no. 476. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121
- 187. References III Chong Wang, John W Paisley, and David M Blei, Online variational inference for the hierarchical dirichlet process, International Conference on Artiﬁcial Intelligence and Statistics, 2011, pp. 752–760. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121
- 188. Images source I http://christmasstockimages.com/free/ideas_concepts/slides/dice_throw.htm http://www.ﬂickr.com/photos/autumn2may/3965964418/ http://www.ﬂickr.com/photos/ppix/1802571058/ http://yesurakezu.deviantart.com/art/Domo-s-head-exploding-with-dice-298452871 http://www.ﬂickr.com/photos/jwight/2710392971/ http://www.ﬂickr.com/photos/jasohill/2511594886/ http://en.wikipedia.org/wiki/Kim_Yuna http://en.wikipedia.org/wiki/Hand_in_Hand_%28Olympics%29 http://en.wikipedia.org/wiki/Gangnam_Style JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 121 / 121
- 189. Measurable space (Ω,B) Def) A set considered together with the σ-algebra on the set6 . Ω: the set of all outcomes, the sample space B: σ-algebra over Ω Special kind of collection of subsets of the sample space Ω Complete A is σ-algebra, then AC is also σ-algebra Closed under countable unions and intersections A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra A collection of events Property Smallest possible σ-algebra: {Ω, /0} Largest possible σ-algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
- 190. Measurable space (Ω,B) Def) A set considered together with the σ-algebra on the set6 . Ω: the set of all outcomes, the sample space B: σ-algebra over Ω Special kind of collection of subsets of the sample space Ω Complete A is σ-algebra, then AC is also σ-algebra Closed under countable unions and intersections A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra A collection of events Property Smallest possible σ-algebra: {Ω, /0} Largest possible σ-algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
- 191. Proof 1 Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Then (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) changes to (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) using decimative property with α1 = α0 θ1 = (1 −β1) βk = G0(Ak ) τk = G (Ak ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment