Digging into the Dirichlet Distribution by Max Sklar

  • 4,206 views
Uploaded on

When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible …

When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data.

The Dirichlet distribution is surprisingly expressive on its own, but it can also be used as a building block for even more powerful and deep models such as mixtures and topic models.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,206
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
56
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Digging into the Dirichlet Max Sklar @maxsklar New York Machine Learning Meetup December 19th, 2013
  • 2. Dedication Meyer Marks 1925 - 2013
  • 3. The Dirichlet Distribution
  • 4. Let’s start with something simpler
  • 5. Let’s start with something simpler A Pie Chart!
  • 6. Let’s start with something simpler A Pie Chart! AKA Discrete Distribution Multinomial Distribution
  • 7. Let’s start with something simpler A Pie Chart! K = The number of categories. K=5
  • 8. Examples of Multinomial Distributions
  • 9. Examples of Multinomial Distributions
  • 10. Examples of Multinomial Distributions
  • 11. Examples of Multinomial Distributions
  • 12. Examples of Multinomial Distributions
  • 13. What does the raw data look like?
  • 14. What does the raw data look like? id # dislikes 1 231 23 2 Counts! # likes 81 40 3 67 9 4 121 14 5 9 31 6 18 0 7 1 1
  • 15. What does the raw data look like? More specifically: - K columns of counts - N rows of data id # likes # dislikes 1 231 23 2 81 40 3 67 9 4 121 14 5 9 31 6 18 0 7 1 1
  • 16. BUT... Counts != Multinomial Distribution
  • 17. BUT... We can estimate the multinomial distribution with the counts, using the maximum likelihood estimate 366 181 203
  • 18. BUT... We can estimate the multinomial distribution with the counts, using the maximum likelihood estimate Sum = 366 + 181 + 203 = 750 366 181 203
  • 19. BUT... We can estimate the multinomial distribution with the counts, using the maximum likelihood estimate 366 / 750 181 / 750 203 / 750 366 181 203
  • 20. BUT... We can estimate the multinomial distribution with the counts, using the maximum likelihood estimate 48.8% 24.1% 27.1% 366 181 203
  • 21. BUT... 366 203 1 Uh Oh 181 2 1
  • 22. BUT... 366 This column will be all Yellow right? 181 203 1 2 1 0 1 0
  • 23. BUT... 366 203 1 Panic!!!! 181 2 1 0 1 0 0 0 0
  • 24. Bayesian Statistics to the Rescue
  • 25. Bayesian Statistics to the Rescue Still assume each row was generated by a multinomial distribution
  • 26. Bayesian Statistics to the Rescue Still assume each row was generated by a multinomial distribution We just don’t know which one!
  • 27. The Dirichlet Distribution Is a probability distribution over all possible multinomial distributions, p.
  • 28. The Dirichlet Distribution ? Represents our uncertainty over the actual distribution that created the row. ?
  • 29. The Dirichlet Distribution p: represents a multinomial distribution alpha: the parameters of the dirichlet K: the number of categories
  • 30. Bayesian Updates
  • 31. Bayesian Updates Also a Dirichlet!
  • 32. Bayesian Updates Also a Dirichlet! (Conjugate Prior)
  • 33. Bayesian Updates
  • 34. Bayesian Updates
  • 35. Bayesian Updates +1
  • 36. Why Does this Work? Let’s look at it again.
  • 37. Entropy
  • 38. Entropy Information Content
  • 39. Entropy Information Content Energy
  • 40. Entropy Information Content Energy Log Likelihood
  • 41. Entropy Information Content Energy Log Likelihood
  • 42. The Dirichlet Distribution
  • 43. The Dirichlet Distribution Normalizing Constant
  • 44. The Dirichlet Distribution Normalizing Constant
  • 45. The Dirichlet Distribution
  • 46. The Dirichlet Distribution
  • 47. The Dirichlet Distribution
  • 48. The Dirichlet Distribution Linear
  • 49. The Dirichlet MACHINE Prior 1.2 3.0 0.3
  • 50. The Dirichlet MACHINE Prior 1.2 3.0 0.3
  • 51. The Dirichlet MACHINE Update 2.2 3.0 0.3
  • 52. The Dirichlet MACHINE Prior 2.2 3.0 0.3
  • 53. The Dirichlet MACHINE Update 2.2 3.0 1.3
  • 54. The Dirichlet MACHINE Prior 2.2 3.0 1.3
  • 55. The Dirichlet MACHINE Update 2.2 3.0 2.3
  • 56. Interpreting the Parameters 1 3 2 What does this alpha vector really mean?
  • 57. Interpreting the Parameters 1 normalized 1/6 3/6 3 2 sum 2/6 6
  • 58. Interpreting the Parameters 1 Expected Value 3 2 6
  • 59. Interpreting the Parameters 1 Expected Value 3 2 6 Weight
  • 60. ANALOGY: Normal Distribution Precision = 1 / variance
  • 61. ANALOGY: Normal Distribution High precision: data is close to the mean Low precision: far away from the mean
  • 62. Interpreting the Parameters 1 Expected Value 3 2 6 Precision
  • 63. High Weight Dirichlet 0.4 0.4 0.2
  • 64. Low Weight Dirichlet 0.4 0.4 0.2
  • 65. Urn Model At each step, pick a ball from the urn.. Replace it, and add another ball of that color into the urn
  • 66. Urn Model At each step, pick a ball from the urn.. Replace it, and add another ball of that color into the urn
  • 67. Urn Model At each step, pick a ball from the urn.. Replace it, and add another ball of that color into the urn
  • 68. Urn Model At each step, pick a ball from the urn.. Replace it, and add another ball of that color into the urn
  • 69. Urn Model At each step, pick a ball from the urn.. Replace it, and add another ball of that color into the urn
  • 70. Urn Model Rich get richer...
  • 71. Urn Model Rich get richer...
  • 72. Urn Model Finally yellow catches a break
  • 73. Urn Model Finally yellow catches a break
  • 74. Urn Model But it’s too late...
  • 75. Urn Model As the urn gets more populated, the distribution gets “stuck” in place.
  • 76. Urn Model Once lots of data has been collected, or the dirichlet has high precision, it’s hard to overturn that with new data
  • 77. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 78. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 79. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 80. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 81. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 82. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 83. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 84. Chinese Restaurant Process When you find the white ball, throw a new color into the mix.
  • 85. Chinese Restaurant Process The expected infinite distribution (mean) is exponential. # of white balls controls the exponent
  • 86. So coming back to the count data: 20 0 2 What Dirichlet parameters explain the data? 0 1 17 14 6 0 15 5 0 0 20 0 0 14 6
  • 87. So coming back to the count data: 20 0 2 Newton’s Method: Requires Gradient + Hessian 0 1 17 14 6 0 15 5 0 0 20 0 0 14 6
  • 88. So coming back to the count data: 20 0 2 Reads all of the data... 0 1 17 14 6 0 15 5 0 0 20 0 0 14 6
  • 89. So coming back to the count data: 20 https://github. com/maxsklar/Baye sPy/tree/master/Con jugatePriorTools 0 0 2 1 17 14 6 0 15 5 0 0 20 0 0 14 6
  • 90. So coming back to the count data: Compress the data into a Matrix and a Vector: Works for lots of sparsely populated rows 20 0 0 2 1 17 14 6 0 15 5 0 0 20 0 0 14 6
  • 91. The Compression MACHINE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 92. The Compression MACHINE K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 93. The Compression MACHINE 1 0 0 K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 94. The Compression MACHINE K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  • 95. The Compression MACHINE 1 3 2 K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  • 96. The Compression MACHINE K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 2 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 2 1 1 1 1 1
  • 97. The Compression MACHINE 0 6 0 K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 2 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 2 1 1 1 1 1
  • 98. The Compression MACHINE K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 2 0 0 0 0 0 2 2 2 1 1 1 1 1 0 0 0 0 3 2 2 2 2 2
  • 99. The Compression MACHINE 2 1 1 K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 2 0 0 0 0 0 2 2 2 1 1 1 1 1 0 0 0 0 3 2 2 2 2 2
  • 100. The Compression MACHINE K=3 (the 4th row is a special, total row) M=6 The maximum # samples per input 3 1 0 0 0 0 3 2 2 1 1 1 2 1 0 0 0 0 4 3 3 3 2 2
  • 101. DEMO
  • 102. Our Popularity Prior
  • 103. Dirichlet Mixture Models Anything you can go with a Gaussian, you can also do with a Dirichlet
  • 104. Dirichlet Mixture Models Example: Mixture of Gaussians using ExpectationMaximization
  • 105. Dirichlet Mixture Models Assume each row is a mixture of multinomials. And the parameters of that mixture are pulled from a Dirichlet. ?
  • 106. Dirichlet Mixture Models Latent Dirichlet Allocation Topic Model
  • 107. Questions