Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Dirichlet processes and Applications

76 views

Published on

An intuitive guide to the working of Dirichlet distributions and processes, and their applications.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Dirichlet processes and Applications

  1. 1. Dirichlet Processes and Applications Saurav Jha Machine Learning Engineer Copyright © 2018 FactSet Research Systems Inc. All rights reserved. Confidential: Do not forward.
  2. 2. 1. Probability 101: Mass & Density Functions 2. Probability 102: Simplex and its geometrical meaning 3. Dirichlet Distribution 4. Dirichlet Process 5. A demo 6. An application Table of Contents 2
  3. 3. Probability 101 • PDF = probability that a continuous random variable has a particular range of values • PMF = probability that a discrete random variable is exactly equal to some value 3 • In continuous setting: ∫b a f(x)dx = prob. that outcome is between a and b i.e., units of f(x) = prob. Per unit length (dx) = how dense is probability per unit length near x • In discrete setting: f(x) = Pr(X = x) i.e., units of f(x) = simple probability = what is the mass of object X at point x • Set of PMFs on entire sample space. S = { x E Rn : xi >= 0, ∑i=1..n xi = 1} Probability Mass Function (PMF) vs Density Function (PDF) Probability Simplex
  4. 4. 4 • A k-dimensional polytope ( a geometric object with flat sides) formed from convex hull of its k+1 vertices. Probability 102: K-Simplex – geometrical meaning • Let u0, u1, …, uk E Rk be (k+1) points, then the simplex determined by them = set of points: C = {Ɵ0u0 + … + Ɵkuk | ∑i = 0...k Ɵi = 1 and Ɵi >= 0 ∀ i }  Looking at u0, u1, u2 as a disjoint set of possible events, such that their probs. sum to 1. i.e. p0 + p1 + p2 = 1, where 0 <= pi <= 1  Consider the three probabilities as points in Euclidean space (p1,p2,p3).  Resulting shape outlines the perimeter of a triangle.  While the set C lies in a k-dim. Space (k=3), the object it forms is (k-1) dimensional.  Each point pi in the simplex = a pmf in its own (i.e. each component of pi = [0,1] and all its components sum up to 1).
  5. 5. Dirichlet distribution 5 • Let Q = [Q1, Q2, …, Qk] = a random pmf (i.e. Qi >= 0) for i = 1,2,…, k and ∑i=1..k Qi = 1. • Let α = [α1, α2, . . . , αk], with αi > 0 for each i, and let α0 = ∑i=1..k αi • Then, Q = a Dirichlet distribution with param. α and is denoted by Q ∼ Dir(α): P(Q1, Q2, …, Qk) = • A probability distribution whose samples lie in the (k-1) dimensional probability simplex ∆k, i.e., a distribution over pmfs of length k. • Ranges over possible parameters vectors for a multinomial distribution and is the conjugate prior of multinomial distribution. “A distribution of distributions”
  6. 6. Dirichlet distribution – an example use-case • X = vector representing n draws of a random var. with 3 possible outcomes = [4,4,2] • PMF of X = multinomial distribution = (p1n1* p2n2 * p3n3) * n!/ n1!*n2!*n3! 6 Q) What if p1, p2, p3 are unknown? i.e., no certainty over what the distribution of categorical vars. is!  Solution: use a Dirichlet distribution with params α1, α2, α3 to first draw a P ~ Dir(α), and then, draw X ~ Multi(p). • Introduces one level of indirection in the model for X – instead of saying what P generated X, use params α1, α2, α3 to find likely prob. Distributions and then draw samples X acc. To random P. • Since, sampling is directly from a prob. K-Simplex => the values of a k-dim. Dirichlet distribution = mean value of the Dirichlet. • Addition of the Dirichlet distribution = introducing prior beliefs about what X is likely to occur. i.e., a random pmf has a Dirichlet distribution with param α. [1] • Analogy 1: if a random pmf = a bag full of dice, then a sample from the Dirichlet = a specific dice.
  7. 7. Dirichlet Process  Dirichlet Processes to the Rescue ! 7 • In the dice analogy, the dice must have a finite no. of faces. • Limitation of Dirichlet distribution = assumes a finite set of events. • Enables working with an infinite set of events, and hence to model prob. Distributions over infinite sample spaces. Analogy 2: • Asking a pedestrians on the street to choose their fav. Color out of {V,I,B,G,Y,O,R}. • Based on answer, model each person as a pmf over 7 colors. • Each person’s pmf = a realization of a draw from a Dirichlet distribution over 7 colors.  What if the choices are no longer restricted to 7 colors? • Modelling an individual’s pmfs (over infinite dim.) = a distribution over distributions over an infinite samle space. • One solution = a Dirichlet process.
  8. 8. Dirichlet Process – definition  Input = H (a prob. Distribution a.k.a base distribution), α (a +ve real no. a.k.a concentration param.)  Draw A (i.e., nth element) from H.  For n > 1:  Assign A to a new category with the prob. α / (α + n – 1).  Assign A to a pre-existing category x with prob. nx / (α + n – 1), where nx = no. of random variables already assigned to x. 8 • Assign elements A,B,C to unknown no. of categories following the algorithm: • Used when modelling data that tends to repeat previous values in a “rich get richer” fashion. • Can also be defined as a Chinese Restaurant Process. • Applications: Morphological segmentation in NLP, Modelling mutation rates of genes in evolutionary biology.
  9. 9. A demo [2] 9
  10. 10. An application: Learning of hierarchical Morphology paradigms [3] • A paradigm = a pair (StemList, SuffixList) where, each Stem+Suffix string = a valid word. • Can be modelled as a hierarchical structure. 10 • Morphologically similar words = close to each other in the structure. • Similarity metric = # common morphemes • Notations: w = word, s = stem, m = suffix • Assumption: Stems and suffixes generated independently from each other. • Prob. of a word = p(w = s+m) = p(s) * p(m)
  11. 11. An application: Learning of hierarchical Morphology paradigms [3] 11 1. Two Dirichlet processes generate stems and suffixes independently: • βs = concentration parameter, i.e., no. of stem types generated by the DP • If β = small, new stem/suffix types are less likely to be generated. • β = large, more likely to generate new stem/suffix types, thus yielding more uniform distribution. • Authors choose β < 1, i.e. to yield a more skewed distribution with sparse stems & suffixes. • P = base distribution specifying prior prob. Distribution for morpheme lengths. • Joint prob. Of stems can then be calculated as:
  12. 12. References 1. Frigyik, Bela A. et al. “Introduction to the Dirichlet Distribution and Related Processes.” (2010). 2. http://phyletica.org/dirichlet-process/ 3. Can, Burcu and Suresh Manandhar. “Probabilistic Hierarchical Clustering of Morphological Paradigms.” EACL (2012). 12
  13. 13. THANK YOU ! 13

×