Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

1,743 views

Published on

This is presentation slide files in machine learning summer school in Korea.
http://prml.yonsei.ac.kr/
I talked about dirichlet distribution, dirichlet process and HDP.

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,743
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
88
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes

  1. 1. Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes JinYeong Bak Department of Computer Science KAIST, Daejeon South Korea jy.bak@kaist.ac.kr August 22, 2013 Part of this slides adopted from presentation by Yee Whye Teh (y.w.teh@stats.ox.ac.uk). JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 1 / 121
  2. 2. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 2 / 121
  3. 3. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 3 / 121
  4. 4. Introduction Bayesian topic models Latent Dirichlet Allocation (LDA) [BNJ03] Hierarchical Dircihlet Processes (HDP) [TJBB06] In this talk, Dirichlet distribution, Dircihlet process Concept of Hierarchical Dircihlet Processes (HDP) How to infer the latent variables in HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 4 / 121
  5. 5. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 5 / 121
  6. 6. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  7. 7. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  8. 8. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  9. 9. Motivation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 6 / 121
  10. 10. Motivation What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 7 / 121
  11. 11. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 8 / 121
  12. 12. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  13. 13. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  14. 14. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  15. 15. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 9 / 121
  16. 16. Topic Modeling Each topic has word distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 10 / 121
  17. 17. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
  18. 18. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
  19. 19. Topic Modeling Each document has topic proportion Each word has its own topic index JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 11 / 121
  20. 20. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  21. 21. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  22. 22. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  23. 23. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  24. 24. Topic Modeling JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 12 / 121
  25. 25. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  26. 26. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  27. 27. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  28. 28. Latent Dirichlet Allocation Generative process of LDA For each topic k ∈ {1,...,K}: Draw word distributions βk ∼ Dir(η) For each document d ∈ {1,...,D}: Draw topic proportions θd ∼ Dir(α) For each word in a document n ∈ {1,...,N}: Draw a topic index zdn ∼ Mult(θ) Generate word from chosen topic wdn ∼ Mult(βzdn ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 13 / 121
  29. 29. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? How can we describe the topics? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 14 / 121
  30. 30. Latent Dirichlet Allocation What we can see Words in documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 15 / 121
  31. 31. Latent Dirichlet Allocation What we want to see JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 16 / 121
  32. 32. Latent Dirichlet Allocation Our interests What are the topics discussed in the article? => Topic proportion of each document How can we describe the topics? => Word distribution of each topic JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 17 / 121
  33. 33. Latent Dirichlet Allocation What we can see: w What we want to see: θ,z,β ∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η) p(w|α,η) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
  34. 34. Latent Dirichlet Allocation What we can see: w What we want to see: θ,z,β ∴ Compute p(θ,z,β|w,α,η) = p(θ,z,β,w|α,η) p(w|α,η) But this distribution is intractable to compute ( normalization term) So we do approximate methods Gibbs Sampling Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 18 / 121
  35. 35. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should find the best number of topics Q) Can we get it from data automatically? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 19 / 121
  36. 36. Limitation of Latent Dirichlet Allocation Latent Dirichlet Allocation is parametric model People should assign the number of topics in a corpus People should find the best number of topics Q) Can we get it from data automatically? A) Hierarchical Dircihlet Processes JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 20 / 121
  37. 37. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 21 / 121
  38. 38. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
  39. 39. Dice modeling Think about the probability of a number from dices Each dice has its own pmf According to the textbook, it is widely known as uniform => 1 6 for 6 dimentional dice Is it true? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 22 / 121
  40. 40. Dice modeling Think about the probability of a number from dices According to the textbook, it is widely known as uniform. => 1 6 for 6 dimentional dice Is it true? Ans) No! JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 23 / 121
  41. 41. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Let’s imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
  42. 42. Dice modeling We should model the randomness of pmfs for each dice How can we do that? Let’s imagine a bag which has many dices We cannot see inside the bag We can draw out one dice from bag OK, but what is the formal description? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 24 / 121
  43. 43. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
  44. 44. Standard Simplex A generalization of the notion of a triangle or tetrahedron All points are non-negative and sum to 1 1 A pmf can be thought of as a point in the standard simplex Ex) A point p = (x,y,z), where x ≥ 0,y ≥ 0,z ≥ 0 and x +y +z = 1 1 http://en.wikipedia.org/wiki/Simplex JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 25 / 121
  45. 45. Dirichlet distribution Definition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
  46. 46. Dirichlet distribution Definition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
  47. 47. Dirichlet distribution Definition [BN06] A probability distribution over the (K −1) dimensional standard simplex A distribution over pmfs of length K Notation θ ∼ Dir(α) where θ = [θ1,...,θK ] is random pmf, α = [α1,...,αK ] Probability density function p(θ;α) = Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 26 / 121
  48. 48. Latent Dirichlet Allocation JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 27 / 121
  49. 49. Property of Dirichlet distribution Density plots [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 28 / 121
  50. 50. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 29 / 121
  51. 51. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α) ∴ Posterior (θ|X) ∼ Dir(α +n) Proof) p(θ|X) = p(X|θ)p(θ) p(X) ∝ p(X|θ)p(θ) = n! x1!···xK ! K ∏ k=1 θxk k · Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k = C K ∏ k=1 θαk +xk −1 k = Dir(α +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
  52. 52. Property of Dirichlet distribution When K = 2, it is Beta distribution Conjugate prior for the Multinomial distribution Likelihood X ∼ Mult(n,θ), Prior θ ∼ Dir(α) ∴ Posterior (θ|X) ∼ Dir(α +n) Proof) p(θ|X) = p(X|θ)p(θ) p(X) ∝ p(X|θ)p(θ) = n! x1!···xK ! K ∏ k=1 θxk k · Γ(∑K k=1 αk ) ∏K k=1 Γ(αk ) K ∏ k=1 θα−1 k = C K ∏ k=1 θαk +xk −1 k = Dir(α +n) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 30 / 121
  53. 53. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  54. 54. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  55. 55. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  56. 56. Property of Dirichlet distribution Aggregation property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then (θ1 +θ2,...,θK ) ∼ Dir(α1 +α2,...,αK ) In general, if {A1,...,AR} is any partition of {1,...,K}, then (∑k∈A1 θk ,...,∑k∈AR θk ) ∼ Dir(∑k∈A1 αk ,...,∑k∈AR αk ) Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Neutrality property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) then θk is independent of the vector 1 1−θk (θ1,θ2,...,θk−1,θk+1,...,θK ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 31 / 121
  57. 57. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 32 / 121
  58. 58. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
  59. 59. Dice modeling Think about the probability of a number from dices Each dice has its own pmf Draw out a dice from a bag Problem) Do not know the number of face in a bag Solution) Dirichlet process JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 33 / 121
  60. 60. Dirichlet Process Definition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal definition (Ω,B) is a measurable space G0 is a distribution over sample space Ω α0 is a positive real number G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any finite measurable partition (A1,...,AR) of Ω (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
  61. 61. Dirichlet Process Definition [BAFG10] A distribution over probability measures A distribution whose realizations are distribution over any sample space Formal definition (Ω,B) is a measurable space G0 is a distribution over sample space Ω α0 is a positive real number G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any finite measurable partition (A1,...,AR) of Ω (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 34 / 121
  62. 62. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  63. 63. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  64. 64. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  65. 65. Posterior Dirichlet Processes G ∼ DP(α0,G0) can be treat as a random distribution over Ω We can draw a sample θ1 from G We also can make finite partition, (A1,...,AR) of Ω then p(θ1 ∈ Ar |G) = G(Ar ), p(θ1 ∈ Ar ) = G0(Ar ) (G(A1),...,G(AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) Using Dirichlet-multinomial conjugacy, the posterior is (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ (Ar ) = 1 if θ ∈ Ar and 0 otherwise It is always true for every finite partition of Ω JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 35 / 121
  66. 66. Posterior Dirichlet Processes For every finite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
  67. 67. Posterior Dirichlet Processes For every finite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
  68. 68. Posterior Dirichlet Processes For every finite partition of Ω, (G(A1),...,G(AR))|θ1 ∼Dir(α0G0(A1)+δθ1 (A1),...,α0G0(AR)+δθ1 (AR)) where δθ1 (Ar ) = 1 if θ1 ∈ Ar and 0 otherwise The posterior process is also a Dirichlet process G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Summary) θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 36 / 121
  69. 69. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
  70. 70. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
  71. 71. Blackwell-MacQueen Urn Scheme Now we draw a sample θ1,...,θN First sample θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Second sample θ2|θ1,G ∼ G G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) ⇐⇒ θ2|θ1 ∼ α0G0 +δθ1 α0 +1 G|θ1,θ2 ∼ DP(α0 +2, α0G0 +δθ1 +δθ2 α0 +2 ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 37 / 121
  72. 72. Blackwell-MacQueen Urn Scheme Nth sample θN|θ1,...,N−1,G ∼ G G|θ1,...,N−1 ∼ DP(α0 +N −1, α0G0 +∑N−1 n=1 δθn α0 +N −1 ) ⇐⇒ θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 G|θ1,...,N ∼ DP(α0 +N, α0G0 +∑N n=1 δθn α0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 38 / 121
  73. 73. Blackwell-MacQueen Urn Scheme Blackwell-MacQueen urn scheme produces a sequence θ1,θ2,... with the following conditionals θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 As Polya Urn analogy Infinite number of ball colors Empty urn Filling Polya urn process (n starts 1) With probability α0, pick a new color from the set of infinite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 39 / 121
  74. 74. Chinese Restaurant Process Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme With probability α0, pick a new color from the set of infinite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color θs can take same values, θi = θj There are K < N distinct values, φ1,...,φK It works as partition of Ω θ1,θ2,...,θN induces to φ1,...,φK The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
  75. 75. Chinese Restaurant Process Draw θ1,θ2,...,θN from a Blackwell-MacQueen Urn Scheme With probability α0, pick a new color from the set of infinite ball colors G0, and paint a new ball that color and add it to urn With probability n −1, pick a ball from urn record its color, and put it back to urn with another ball of the same color θs can take same values, θi = θj There are K < N distinct values, φ1,...,φK It works as partition of Ω θ1,θ2,...,θN induces to φ1,...,φK The distribution over partitions is called the Chinese Restaurant Process (CRP) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 40 / 121
  76. 76. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  77. 77. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  78. 78. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  79. 79. Chinese Restaurant Process θ1,θ2,...,θN induces to φ1,...,φK Chinese Restaurant Process interpretation There is a Chinese Restaurant which has infinite tables Each customer sits at a table Generating from the Chinese Restaurant Process First customer sits at the first table n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability nk α0+n−1 , where nk is the number of customers at table k JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 41 / 121
  80. 80. Chinese Restaurant Process The CRP exhibits the clustering property of DP Tables are clusters, φk ∼ G0 Customers are the actual realizations, θn = φzn where zn ∈ {1,...,K} JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 42 / 121
  81. 81. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
  82. 82. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
  83. 83. Stick Breaking Construction Blackwell-MacQueen Urn Scheme / CRP generates θ ∼ G, not G itself To construct G, we use Stick Breaking Construction Review) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) ∼ Dir((α0 +1) α0G0 +δθ1 α0 +1 (θ1),(α0 +1) α0G0 +δθ1 α0 +1 (Ωθ1)) = Dir(1,α0) = Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 43 / 121
  84. 84. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
  85. 85. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
  86. 86. Stick Breaking Construction Consider a partition (θ1,Ωθ1) of Ω. Then (G(θ1),G(Ωθ1)) = (β1,1 −β1) ∼ Beta(1,α0) G has a point mass located at θ1 G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) where G is the probability measure with the point mass θ1 removed What is G ? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 44 / 121
  87. 87. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
  88. 88. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
  89. 89. Stick Breaking Construction Summary) Posterior Dirichlet Processes θ1|G ∼ G G ∼ DP(α0,G0) ⇐⇒ θ1 ∼ G0 G|θ1 ∼ DP(α0 +1, α0G0 +δθ1 α0 +1 ) G = β1δθ1 +(1 −β1)G β1 ∼ Beta(1,α0) Consider a further partition (θ1,A1,...,AR) of Ω (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) Using decimative property of Dirichlet distribution (proof) (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 45 / 121
  90. 90. Stick Breaking Construction Do this repeatly with distinct values, φ1,φ2,··· G ∼ DP(α0,G0) G = β1δφ1 +(1 −β1)G1 G = β1δφ1 +(1 −β1)(β2δφ2 +(1 −β2)G2) ... G = ∞ ∑ k=1 πk δφk where πk = βk k−1 ∏ i=1 (1 −βi ), ∞ ∑ k=1 πk = 1 βk ∼ Beta(1,α0) φk ∼ G0 Draws from the DP looks like a sum of point masses, with masses drawn from a stick-breaking construction. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 46 / 121
  91. 91. Stick Breaking Construction Summary) G = ∞ ∑ k=1 πk δφk πk = βk k−1 ∏ i=1 (1 −βi ), ∞ ∑ k=1 πk = 1 βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 47 / 121
  92. 92. Summary of DP Definition G is a random probability measure over (Ω,B) G ∼ DP(α0,G0) if for any finite measurable partition (A1,...,Ar ) of Ω (G(A1),...,G(Ar )) ∼ Dir(α0G0(A1),...,α0G0(Ar )) Chinese Restaurant Process Stick Breaking Construction JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 48 / 121
  93. 93. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 49 / 121
  94. 94. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) Each θn is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
  95. 95. Dirichlet Process Mixture Models We model a data set x1,...,xN using the following model [Nea00] xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) Each θn is a latent parameter modelling xn, while G is the unknown distribution over parameters modelled using a DP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 50 / 121
  96. 96. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently define θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
  97. 97. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently define θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
  98. 98. Dirichlet Process Mixture Models Since G is of the form G = ∞ ∑ k=1 πk δφk We have θn = φk with probability πk Let kn take on value k with probability πk . We can equivalently define θn = φkn An equivalent model xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 51 / 121
  99. 99. Dirichlet Process Mixture Models ⇐⇒ xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 52 / 121
  100. 100. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 53 / 121
  101. 101. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
  102. 102. Topic modeling with documents Each document consists of bags of words Each word in a document has latent topic index Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 54 / 121
  103. 103. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = ∞ ∑ k=1 π1k δφ1k , G2 = ∞ ∑ k=1 π2k δφ2k φ1k ,φ2k ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
  104. 104. Problem of Naive Dirichlet Process Mixture Model Use a DP mixutre for each document xdn ∼ F(θdn), θdn ∼ Gd , Gd ∼ DP(α0,G0) But there is no sharing of clusters across different groups because G0 is smooth G1 = ∞ ∑ k=1 π1k δφ1k , G2 = ∞ ∑ k=1 π2k δφ2k φ1k ,φ2k ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 55 / 121
  105. 105. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 ∼ DP(γ,H) G1,G2|G0 ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
  106. 106. Problem of Naive Dirichlet Process Mixture Model Solution Make the base distribution G0 discrete Put a DP prior on the common base distribution Hierarchical Dirichlet Process G0 ∼ DP(γ,H) G1,G2|G0 ∼ DP(α0,G0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 56 / 121
  107. 107. Hierarchical Dirichlet Processes Making G0 discrete forces shared cluster between G1 and G2 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 57 / 121
  108. 108. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 ∼ DP(γ,H) Gd |G0 ∼ DP(α0,G0) The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ k=1 πdk δφk πdk = πdk k−1 ∏ i=1 (1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 58 / 121
  109. 109. Chinese Restaurant Franchise Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 59 / 121
  110. 110. Chinese Restaurant Franchise Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... Draw θd 1,θd 2,... from a Blackwell-MacQueen Urn Scheme θd 1,θd 2,... induces to φd 1,φd 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 60 / 121
  111. 111. Chinese Restaurant Franchise G0 ∼ DP(γ,H), φk ∼ H Gd |G0 ∼ DP(α0,G0), θdn ∼ G0 Draw θd1,θd2,... from a Blackwell-MacQueen Urn Scheme θd1,θd2,... induces to φd1,φd2,... Draw θd 1,θd 2,... from a Blackwell-MacQueen Urn Scheme θd 1,θd 2,... induces to φd 1,φd 2,... JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 61 / 121
  112. 112. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  113. 113. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  114. 114. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  115. 115. Chinese Restaurant Franchise Chinese Restaurant Franchise interpretation Each restaurant has infinite tables All restaurant share food menu Each customer sits at a table Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 62 / 121
  116. 116. Chinese Restaurant Franchise JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 63 / 121
  117. 117. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
  118. 118. HDP for Topic modeling Questions What can we assume about the topics in a document? What can we assume about the words in the topics? Solution Each document consists of bags of words Each word in a document has latent topic Latent topics for words in a document can be grouped Each document has topic proportion Each topic has word distribution Topics must be shared across documents JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 64 / 121
  119. 119. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 65 / 121
  120. 120. Gibbs Sampling Definition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
  121. 121. Gibbs Sampling Definition A special case of Markov-chain Monte Carlo (MCMC) method An iterative algorithm that constructs a dependent sequence of parameter values whose distribution converges to the target joint posterior distribution [Hof09] Algorithm Find full conditional distribution of latent variables of target distribution Initialize all latent variables Sampling until converged Sample one latent variable from full conditional distribution JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 66 / 121
  122. 122. Collapsed Gibbs sampling A collapsed Gibbs sampling integrates out one or more variables when sampling for some other variable. Example) There are three latent variables A,B and C. Sampling p(A|B,C), p(B|A,C) and p(C|A,B) sequentially But when we integrate out B, Sampling only p(A|C), p(C|A) sequentially JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 67 / 121
  123. 123. Review) Dirichlet Process Mixture Models ⇐⇒ xn ∼ F(θn) θn ∼ G G ∼ DP(α0,G0) ⇐⇒ xn ∼ F(φkn ) p(kn = k) = πk πk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,α0) φk ∼ G0 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 68 / 121
  124. 124. Review) Blackwell-MacQueen Urn Scheme for DP Nth sample θN|θ1,...,N−1,G ∼ G G|θ1,...,N−1 ∼ DP(α0 +N −1, α0G0 +∑N−1 n=1 δθn α0 +N −1 ) ⇐⇒ θN|θ1,...,N−1 ∼ α0G0 +∑N−1 n=1 δθn α0 +N −1 G|θ1,...,N ∼ DP(α0 +N, α0G0 +∑N n=1 δθn α0 +N ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 69 / 121
  125. 125. Review) Chinese Restaurant Franchise Generating from the Chinese Restaurant Franchise For each restaurant First customer sits at the first table and choose a new menu n-th customer sits at A new table with probability α0 α0+n−1 Table k with probability ndt α0+n−1 where ndt is the number of customers at table t n-th customer choose A new menu with probability γ γ+m−1 Existing menu with probability mk γ+m−1 where m is the number of tables in all restaurant, mk is the number of chosen menu k in all restaurant JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 70 / 121
  126. 126. Alternative form of HDP G0 ∼ DP(γ,H), φdt ∼ G0 ∴ G0|φdt ,... ∼ DP(γ +m, γH+∑K k=1 mk δφk γ+m ) Then G0 is given as G0 = K ∑ k=1 βk δφk +βuGu where Gu ∼ DP(γ,H) π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ) p(φk |·) ∝ h(φk ) ∏ dn:zdn=k f(xdn|φk ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
  127. 127. Alternative form of HDP G0 ∼ DP(γ,H), φdt ∼ G0 ∴ G0|φdt ,... ∼ DP(γ +m, γH+∑K k=1 mk δφk γ+m ) Then G0 is given as G0 = K ∑ k=1 βk δφk +βuGu where Gu ∼ DP(γ,H) π = (π1,...,πK ,πu) ∼ Dir(m1,...,mK ,γ) p(φk |·) ∝ h(φk ) ∏ dn:zdn=k f(xdn|φk ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 71 / 121
  128. 128. Hierarchical Dirichlet Processes ⇐⇒ xdn ∼ F(θn) θn ∼ Gd Gd ∼ DP(α0,G0) G0 ∼ DP(γ,H) ⇐⇒ xn ∼ Mult(φzdn ) zdn ∼ Mult(θd ) φk ∼ Dir(η) θd ∼ Dir(α0π) π ∼ Dir(m.1,...,m.K ,γ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 72 / 121
  129. 129. Gibbs Sampling for HDP Joint distribution p(θ,z,φ,x,π,m|α0,η,γ) = p(π|m,γ) K ∏ k=1 p(φk |η) D ∏ d=1 p(θd |α0,π) N ∏ n=1 p(zdn|θd ) p(xdn|zdn,φ) Integrate out θ,φ p(z,x,π,m|α0,η,γ) = Γ(∑K k=1 m.k +γ) ∏K k=1 Γ(m.k )Γ(γ) K ∏ k=1 πm.k −1 k π γ−1 K+1 K ∏ k=1 Γ(∑V v=1 ηv ) ∏V v=1 Γ(ηv ) ∏V v=1 Γ(ηv +nk (·),v ) Γ(∑V v=1 ηv +nk (·),v ) M ∏ d=1 Γ(∑K k=1 α0πk ) ∏K k=1 Γ(α0πk ) ∏K k=1 Γ(α0πk +nk d,(·)) Γ(∑K k=1 α0πk +nk d,(·)) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 73 / 121
  130. 130. Gibbs Sampling for HDP Full conditional distribution of z p(z(d ,n ) = k |z−(d ,n ) ,m,π,x,·) = p(z(d ,n ) = k ,z−(d ,n ),m,π,x|·) p(z−(d ,n ),m,π,x|·) ∝ p(z(d ,n ) = k ,z−(d ,n ) ,m,π,x|·) ∝ α0πk +n k ,−(d ,n ) d ,(·) (ηv +n k ,−(d ,n ) (·),v ) (∑V v=1 ηv +n k ,−(d ,n ) (·),v ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 74 / 121
  131. 131. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the first kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
  132. 132. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the first kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
  133. 133. Gibbs Sampling for HDP Full conditional distribution of m The probability that word xd n is assigned to some table t such that kdt = k p(θd n = φt |φdt = φk ,θ−(d ,n ) ,π) ∝ n (·),−(d ,n ) d,(·),t p(θd n = new table|φdtnew = φk ,θ−(d ,n ) ,π) ∝ α0πk These equations form Dirichlet process with concentration parameter α0πk and assignment of n (·),−(d ,n ) d,(·),t to components The corresponding distribution over the number of components is desired conditional distribution of mdk Antoniak [Ant74] has shown that p(md k = m|z,md k ,π) = Γ(α0πk ) Γ(α0πk +nk d,(·),(·)) s(nk d,(·),(·),m)(α0πk )m where s(n,m) is unsigned Stirling number of the first kind JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 75 / 121
  134. 134. Gibbs Sampling for HDP Full conditional distribution of π (π1,π2,...,πK ,πu)|· ∼ Dir(m.1,m.2,...,m.K ,γ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 76 / 121
  135. 135. Gibbs Sampling for HDP Algorithm 1 Gibbs Sampling for HDP 1: Initialize all latent variables as random 2: repeat 3: for Each document d do 4: for Each word n in document d do 5: Sample z(d,n) ∼ Mult α0πk +n k ,−(d,n) d ,(·) (ηv +n k ,−(d,n) (·),v ) (∑V v=1 ηv +n k ,−(d,n) (·),v ) 6: end for 7: Sample m ∼ Mult Γ(α0πk ) Γ(α0πk +nk d,(·),(·) ) s(nk d,(·),(·),m)(α0πk )m 8: Sample β ∼ Dir(m.1,m.2,...,m.K ,γ) 9: end for 10: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 77 / 121
  136. 136. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 78 / 121
  137. 137. Stick Breaking Construction A Hierarchical Dirichlet Process with 1,...,D documents G0 ∼ DP(γ,H) Gd |G0 ∼ DP(α0,G0) The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ k=1 πdk δφk πdk = πdk k−1 ∏ i=1 (1 −πdl ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 79 / 121
  138. 138. Alternative Stick Breaking Construction Problem) Original Stick Breaking Construction is weights βk and πdk are tightly correlated βk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,γ) πdk = πdk k−1 ∏ i=1 (1 −πdi ) πdk ∼ Beta(α0βk ,α0(1 − k ∑ i=1 βi )) Alternative Stick Breaking Construction for each document [FSJW08] ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) Gd = ∞ ∑ t=1 πdt δψdt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 80 / 121
  139. 139. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βl ) βk ∼ Beta(1,γ) Gd = ∞ ∑ t=1 πdt δψdt ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 81 / 121
  140. 140. Alternative Stick Breaking Construction The stick-breaking construction for the HDP G0 = ∞ ∑ k=1 βk δφk φk ∼ H βk = βk k−1 ∏ i=1 (1 −βi ) βk ∼ Beta(1,γ) Gd = ∞ ∑ t=1 πdt δψdt ψdt ∼ G0 πdt = πdt t−1 ∏ i=1 (1 −πdi ) πdt ∼ Beta(1,α0) To connect ψdt and φk We add auxiliary variable cdt ∼ Mult(β) Then ψdt = φcdt JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 82 / 121
  141. 141. Alternative Stick Breaking Construction Generative process 1 For each global-level topic k ∈ {1,...,∞}: 1 Draw topic word proportions, φk ∼ Dir(η) 2 Draw a corpus breaking proportion, βk ∼ Beta(1,γ) 2 For each document d ∈ {1,...,D}: 1 For each document-level topic t ∈ {1,...,∞}: 1 Draw document-level topic indices, cdt ∼ Mult(σ(β )) 2 Draw a document breaking proportion, πdt ∼ Beta(1,α0) 2 For each word n ∈ {1,...,N}: 1 Draw a topic index zdn ∼ Mult(σ(πd )) 2 Generate a word wdn ∼ Mult(φcdzdn ), 3 where σ(β ) ≡ {β1,β2,...},βk = βk ∏k−1 i=1 (1 −βi ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 83 / 121
  142. 142. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modified one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
  143. 143. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modified one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 2 2 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 84 / 121
  144. 144. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log ∑ {Z} p(Z,X) = log ∑ {Z} p(Z,X) q(Z|X) q(Z|X) = log ∑ {Z} q(Z|X) p(Z,X) q(Z|X) ≥ ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X)− ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) = ∑ Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensen’s inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
  145. 145. KL-divergence of p from q Find lower bound of log evidence logp(X) logp(X) = log ∑ {Z} p(Z,X) = log ∑ {Z} p(Z,X) q(Z|X) q(Z|X) = log ∑ {Z} q(Z|X) p(Z,X) q(Z|X) ≥ ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) 3 Gap between lower bound of logp(X) and logp(X) logp(X)− ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) = ∑ Z q(Z)log q(Z) p(Z|X) = DKL(q||p) 3 Use Jensen’s inequality JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 85 / 121
  146. 146. KL-divergence of p from q logp(X) = ∑ {Z} q(Z|X)log p(Z,X) q(Z|X) +DKL(q||p) Log evidence logp(X) is fixed with respect to q Minimising DKL(q||p) ≡ Maximizing lower bound of logp(X) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 86 / 121
  147. 147. Variational Inference Main idea [JGJS98] Modify original graphical model to simple model Minimize similarity between original and modified one More Formally Observed data X, Latent variable Z We want to compute p(Z|X) Make q(Z) Minimize similarity between p and q 4 Find lower bound of logp(X) Maximizing it 4 Commonly it is KL-divergence of p from q, DKL(q||p) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 87 / 121
  148. 148. Variational Inference for HDP q(β,φ,π,c,z) = K ∏ k=1 q(φk |λk ) K−1 ∏ k=1 q(βk |a1 k ,a2 k ) D ∏ d=1 T ∏ t=1 q(cdt |ζdt ) T−1 ∏ t=1 q(πdt |γ1 dt ,γ2 dt ) N ∏ n=1 q(zdn|ϕdn) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 88 / 121
  149. 149. Variational Inference for HDP Find lower bound of logp(w|α0,γ,η) lnp(w|α0,γ,η) = ln β φ π ∑ c ∑ z p(w,β,φ,π,c,z|α0,γ,η) dβ dφ dπ = ln β φ π ∑ c ∑ z p(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) q(β,φ,π,c,z) dβ dφ dπ ≥ β φ π ∑ c ∑ z ln p(w,β,φ,π,c,z|α0,γ,η) q(β,φ,π,c,z) ·q(β,φ,π,c,z) dβ dφ dπ ( Jensen’s inequality) = β φ π ∑ c ∑ z lnp(w,β,φ,π,c,z|α0,γ,η)·q(β,φ,π,c,z) dβ dφ dπ − β φ π ∑ c ∑ z lnq(β,φ,π,c,z)·q(β,φ,π,c,z) dβ dφ dπ = Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 89 / 121
  150. 150. Variational Inference for HDP lnp(w|α0,γ,η) ≥ Eq[lnp(w,β,φ,π,c,z|α0,γ,η)]−Eq[lnq(β,φ,π,c,z)] = Eq[lnp(β|γ)p(φ|η) D ∏ d=1 p(πd |α0)p(cd |β) N ∏ n=1 p(wdn|cd ,zdn,φ)p(zdn|πd )] −Eq[ln K ∏ k=1 q(φk |λk ) K−1 ∏ k=1 q(βk |a1 k ,a2 k ) D ∏ d=1 T ∏ t=1 q(cdt |ζdt ) T−1 ∏ t=1 q(πdt |γ1 dt ,γ2 dt ) N ∏ n=1 q(zdn|ϕdn)] = D ∑ d=1 Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )] −Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1 d ,γ2 d )]−Eq[lnq(zd |ϕd )] +Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1 ,a2 )] We can run Variational EM to maximize lower bound of logp(w|α0,γ,η) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 90 / 121
  151. 151. Variational Inference for HDP Maximize lower bound of logp(w|α0,γ,η) Derivative of it with respect to each variational parameter γ1 dt = 1 + N ∑ n=1 ϕdnt , γ2 dt = α0 + N ∑ n=1 T ∑ b=t+1 ϕdnb ζdtk = exp{ k−1 ∑ e=1 (Ψ(a2 e)−Ψ(a1 e +a2 e))+(Ψ(a1 k )−Ψ(a1 k +a2 k )) + N ∑ n=1 V ∑ v=1 wv dnϕdnt (Ψ(λkv )−Ψ( V ∑ l=1 λkl ))} ϕdnt = exp{ t−1 ∑ h=1 (Ψ(γ2 dh)−Ψ(γ1 dh +γ2 dh))+(Ψ(γ1 dt )−Ψ(γ1 dt +γ2 dt )) + K ∑ k=1 V ∑ v=1 wv dnζdtk (Ψ(λkv )−Ψ( V ∑ l=1 λkl ))} a1 k = 1 + D ∑ d=1 T ∑ t=1 ζdtk , a2 k = γ + D ∑ d=1 T ∑ t=1 K ∑ f=k+1 ζdtf λkv = ηv + D ∑ d=1 N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 91 / 121
  152. 152. Variational Inference for HDP Maximize lower bound of logp(w|α0,γ,η) Derivative of it with respect to each variational parameter Run Variational EM E step: compute document level parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt M step: compute corpus level parameters a1 k ,a2 k ,λkv Algorithm 2 Variational Inference for HDP 1: Initialize the variational parameters 2: repeat 3: for Each document d do 4: repeat 5: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 6: until Converged 7: end for 8: Compute topic parameters a1 k ,a2 k ,λkv 9: until Converged JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 92 / 121
  153. 153. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 93 / 121
  154. 154. Online Variational Inference Stochastic optimization to the variational objective [WPB11] Subsample the documents Compute approximation of the gradient based on subsample Follow that gradient with a decreasing step-size JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 94 / 121
  155. 155. Variational Inference for HDP Lower bound of logp(w|α0,γ,η) lnp(w|α0,γ,η) ≥ D ∑ d=1 Eq[lnp(πd |α0)]+Eq[lnp(cd |β)]+Eq[lnp(wd |cd ,zd ,φ)]+Eq[lnp(zd |πd )] −Eq[lnq(cd |ζd )]−Eq[lnq(πd |γ1 d ,γ2 d )]−Eq[lnq(zd |ϕd )] +Eq[lnp(β|γ)]+Eq[lnp(φ|η)]−Eq[lnq(φ|λ)]−Eq[lnq(β|a1 ,a2 )] = D ∑ d=1 Ld +Lk = Eqj [DLd + 1 D Lk ] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 95 / 121
  156. 156. Online Variational Inference for HDP Lower bound of logp(w|α0,γ,η) = Eqj [DLd + 1 D Lk ] Online learning algorithm for HDP Sample a document d Compute its optimal document-level parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt Take the gradient 5 of the corpus level parameters a1 k ,a2 k ,λkv with noise Update corpus level parameters a1 k ,a2 k ,λkv with decreasing learning rate a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) where ρe is the learning rate which satisfy ∑∞ e=1 ρe = ∞, ∑∞ e=1 ρ2 e < ∞ 5 Natural graident is structurally equivalent to the Variational Inference one JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 96 / 121
  157. 157. Online Variational Inference for HDP Algorithm 3 Online Variational Inference for HDP 1: Initialize the variational parameters 2: e = 0 3: for Each document d ∈ {1,...,D} do 4: repeat 5: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 6: until Converged 7: e = e +1 8: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] 9: Update topic parameters a1 k ,a2 k ,λkv 10: end for JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 97 / 121
  158. 158. Outline 1 Introduction Motivation Topic Modeling 2 Background Dirichlet Distribution Dirichlet Processes 3 Hierarchical Dirichlet Processes Dirichlet Process Mixture Models Hierarchical Dirichlet Processes 4 Inference Gibbs Sampling Variational Inference Online Learning Distributed Online Learning 5 Practical Tips 6 Summary JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 98 / 121
  159. 159. Motivation Problem 1: Inference for HDP takes a long time Problem 2: Continuously expanding corpus necessitates continuous updates of model parameters But updating of model parameters is not possible with plain HDP Must re-train with the entire updated corpus Our Approach: Combine distributed inference and online learning JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 99 / 121
  160. 160. Distributed Online HDP Based on variational inference Mini-batch updates via stochastic learning (variational EM) Distribute variational EM using MapReduce JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 100 / 121
  161. 161. Distributed Online HDP Algorithm 4 Distributed Online HDP - Driver 1: Initialize the variational parameters 2: e = 0 3: while Run forever do 4: Collect new documents s ∈ {1,...,S} 5: e = e +1 6: Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] 7: Run MapReduce job 8: Get result of job and update topic parameters 9: end while JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 101 / 121
  162. 162. Distributed Online HDP Algorithm 5 Distributed Online HDP - Mapper 1: Mapper get one document s ∈ {1,...,S} 2: repeat 3: Compute document parameters γ1 dt ,γ2 dt ,ζdtk ,ϕdnt 4: until Converged 5: Output the sufficient statistics for topic parameters Algorithm 6 Distributed Online HDP - Reducer 1: Reducer get sufficient statistics for each topic parameter 2: Compute changes of topic parameter with sufficient statistics 3: Output the changes of topic parameter JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 102 / 121
  163. 163. Experimental Setup Data: 973,266 Twitter conversations, 7.54 tweets / conv Approximately 7,297,000 tweets 60 node Hadoop system Each node with 8 x 2.30GHz cores JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 103 / 121
  164. 164. Result Distributed Online HDP runs faster than online HDP Distributed Online HDP preserve the quality of result (perplexity) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 104 / 121
  165. 165. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial field. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
  166. 166. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial field. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
  167. 167. Practical Tips Unitl now, I talked about Bayesian Nonparametric Topic Modeling Concept of Hierarchical Dirichlet Processes How to infer the latent variables in HDP These are theoretical interests Someone who attended last machine learning winter school said Wow! There are good and interesting machine learning topics! But I want to know about practical issues, because I am in the industrial field. So I prepared some tips for him/her and you JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 105 / 121
  168. 168. Implementation https://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 106 / 121
  169. 169. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 107 / 121
  170. 170. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 108 / 121
  171. 171. HDP JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 109 / 121
  172. 172. Property of Dirichlet distribution Sample pmfs from Dirichlet distribution [BAFG10] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 110 / 121
  173. 173. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  174. 174. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  175. 175. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  176. 176. Assign Dirichlet parameters Dirichlet parameters are less than 1 People usually use a few topics to write a document People usually do not use all topics Each topic usually use a few words to represent its own topic Each topic do not use all words We can assign the each topics/words weights Some topics are more general than others Some words are more general than others Words that have positive/negative meaning are shown in positive/negative sentiments [JO11] JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 111 / 121
  177. 177. Some tips for using topic models How to manage hyper-parameters (Dirichlet parameters)? How to manage learning rate and mini-batch size in online learning? JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 112 / 121
  178. 178. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
  179. 179. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
  180. 180. Compute learning rate ρe = (τ0 +e)−κ where τ0 > 0,κ ∈ (0.5,1] a1 k = (1 −ρe)a1 k +ρe(1 +D T ∑ t=1 ζdtk ) a2 k = (1 −ρe)a2 k +ρe(γ +D T ∑ t=1 K ∑ f=k+1 ζdtf ) λkv = (1 −ρe)λkv +ρe(ηv +D N ∑ n=1 T ∑ t=1 wv dnϕdnt ζdtk ) Meaning of each parameters τ0: Slow down the early iterations of the algorithm κ: Rate at which old value of topic parameters are forgotten So it depends on dataset Usually, we set τ0 = 1.0,κ = 0.7 JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 113 / 121
  181. 181. Mini-batch size When mini-batch size is large, distributed online HDP runs faster Perplexity is similar as others JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 114 / 121
  182. 182. Summary Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes Chinese Restaurant Franchise Stick Breaking Construction Posterior Inference for HDP Gibbs Sampling Variational Inference Online Learning Slides and other materials are uploaded in http://uilab.kaist.ac.kr/members/jinyeongbak Implementations are updated in http://github.com/NoSyu/Topic_Models JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 115 / 121
  183. 183. Further Reading Dirichlet Process Dirichlet Process Dirichlet distribution and Dirichlet Process + Indian Buffet Process Bayesian Nonparametric model Machine Learning Summer School - Yee Whye Teh Machine Learning Summer School - Peter Orbanz Introductory article Inference MCMC Variational Inference JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 116 / 121
  184. 184. Thank You! JinYeong Bak jy.bak@kaist.ac.kr, linkedin.com/in/jybak Users & Information Lab, KAIST JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 117 / 121
  185. 185. References I Charles E Antoniak, Mixtures of dirichlet processes with applications to bayesian nonparametric problems, The annals of statistics (1974), 1152–1174. Amol Kapila Bela A. Frigyik and Maya R. Gupta, Introduction to the dirichlet distribution and related processes, Tech. Report UWEETR-2010-0006, Department of Electrical Engineering, University of Washington, Seattle, WA 98195, December 2010. Christopher M Bishop and Nasser M Nasrabadi, Pattern recognition and machine learning, vol. 1, springer New York, 2006. David M Blei, Andrew Y Ng, and Michael I Jordan, Latent dirichlet allocation, the Journal of machine Learning research 3 (2003), 993–1022. Emily B Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky, An hdp-hmm for systems with state persistence, Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 312–319. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 118 / 121
  186. 186. References II Peter D Hoff, A first course in bayesian statistical methods, Springer, 2009. Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul, An introduction to variational methods for graphical models, Springer, 1998. Yohan Jo and Alice H. Oh, Aspect and sentiment unification model for online review analysis, Proceedings of the fourth ACM international conference on Web search and data mining (New York, NY, USA), WSDM ’11, ACM, 2011, pp. 815–824. Radford M Neal, Markov chain sampling methods for dirichlet process mixture models, Journal of computational and graphical statistics 9 (2000), no. 2, 249–265. Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei, Hierarchical dirichlet processes, Journal of the american statistical association 101 (2006), no. 476. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 119 / 121
  187. 187. References III Chong Wang, John W Paisley, and David M Blei, Online variational inference for the hierarchical dirichlet process, International Conference on Artificial Intelligence and Statistics, 2011, pp. 752–760. JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 120 / 121
  188. 188. Images source I http://christmasstockimages.com/free/ideas_concepts/slides/dice_throw.htm http://www.flickr.com/photos/autumn2may/3965964418/ http://www.flickr.com/photos/ppix/1802571058/ http://yesurakezu.deviantart.com/art/Domo-s-head-exploding-with-dice-298452871 http://www.flickr.com/photos/jwight/2710392971/ http://www.flickr.com/photos/jasohill/2511594886/ http://en.wikipedia.org/wiki/Kim_Yuna http://en.wikipedia.org/wiki/Hand_in_Hand_%28Olympics%29 http://en.wikipedia.org/wiki/Gangnam_Style JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 121 / 121
  189. 189. Measurable space (Ω,B) Def) A set considered together with the σ-algebra on the set6 . Ω: the set of all outcomes, the sample space B: σ-algebra over Ω Special kind of collection of subsets of the sample space Ω Complete A is σ-algebra, then AC is also σ-algebra Closed under countable unions and intersections A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra A collection of events Property Smallest possible σ-algebra: {Ω, /0} Largest possible σ-algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
  190. 190. Measurable space (Ω,B) Def) A set considered together with the σ-algebra on the set6 . Ω: the set of all outcomes, the sample space B: σ-algebra over Ω Special kind of collection of subsets of the sample space Ω Complete A is σ-algebra, then AC is also σ-algebra Closed under countable unions and intersections A and B are σ-algebra, then A ∪B and A ∩B are also σ-algebra A collection of events Property Smallest possible σ-algebra: {Ω, /0} Largest possible σ-algebra: powerset 6 http://mathworld.wolfram.com/MeasurableSpace.html JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 122 / 121
  191. 191. Proof 1 Decimative property Let (θ1,θ2,...,θK ) ∼ Dir(α1,α2,...,αK ) and (τ1,τ2) ∼ Dir(α1β1,α1β2) where β1 +β2 = 1, then (θ1τ1,θ1τ2,θ2,...,θK ) ∼ Dir(α1β1,α1β2,α2,...,αK ) Then (G(θ1),G(A1),...,G(AR)) = (β1,(1 −β1)G (A1),...,(1 −β1)G (AR)) ∼ Dir(1,α0G0(A1),...,α0G0(AR)) changes to (G (A1),...,G (AR)) ∼ Dir(α0G0(A1),...,α0G0(AR)) G ∼ DP(α0,G0) using decimative property with α1 = α0 θ1 = (1 −β1) βk = G0(Ak ) τk = G (Ak ) JinYeong Bak (U&I Lab) Bayesian Nonparametric Topic Modeling August 22, 2013 123 / 121

×