Maximum likelihood-set - introduction

564 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
564
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Maximum likelihood-set - introduction

  1. 1. Language Modeling with the Maximum Likelihood Set (Karakos & Khudanpur, ISIT-2006) http://dx.doi.org/10.1109/ISIT.2006.261575 Yusuke Matsubara Tsujii lab. Meeting 2006-06-22
  2. 2. Necessity of smoothing <ul><li>Estimation from small samples </li></ul><ul><ul><li>Suppose </li></ul></ul><ul><ul><ul><li>The possible word set is { A , B , C } </li></ul></ul></ul><ul><ul><ul><li>Maximum likelihood estimation (unigram) </li></ul></ul></ul><ul><ul><ul><li>The count of word C is accidentally 0 in the corpus </li></ul></ul></ul><ul><ul><ul><ul><li>Like “ A B A A B ” </li></ul></ul></ul></ul><ul><ul><li>MLE predicts that C will never occur </li></ul></ul><ul><ul><ul><li>Even we know C can occur </li></ul></ul></ul><ul><ul><ul><li>This is underestimation </li></ul></ul></ul>
  3. 3. The Maximum Likelihood Set [ Jedynak & Khudanpur 2005 ] <ul><li>Given a set of words and word counts (= a corpus) </li></ul><ul><li>The true pmf should predict that the given corpus is more probable than any other corpus of the same size </li></ul><ul><li>MLS contains all such pmf </li></ul>p 1 +p 2 +p 3 =1 MLEs from possible counts 1 1 1 0 Sample size=3 #(word set) =3
  4. 4. The Maximum Likelihood Set (formal definition) k 2 linear inequality constraints
  5. 5. The Maximum Likelihood Set n = 3 (#samples) k = 3 (#word set) n = 10 (#samples) k = 3 (#word set) Larger samples Nearer to MLE
  6. 6. Choosing a pmf from a MLS <ul><li>Assume a reference pmf </li></ul><ul><li>Choose a pmf that minimize KL-divergence to the reference pmf </li></ul>reference MLS
  7. 7. Conditional pmf estimation <ul><li>Different MLS for each condition </li></ul><ul><li>In case of trigram language modeling, </li></ul><ul><ul><li>|V| 2 MLSs </li></ul></ul><ul><ul><li>And each MLS has |V| 2 constraints </li></ul></ul><ul><ul><ul><li>However, we can remove many redundant constraints </li></ul></ul></ul>
  8. 8. Experimental results <ul><li>Corpus </li></ul><ul><ul><li>UPenn Treebank </li></ul></ul><ul><ul><ul><li>Sect 00-22 (900K words) for training </li></ul></ul></ul><ul><ul><ul><li>Sect 23-24 (100K words) for testing </li></ul></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>Code length of a word (entropy) </li></ul></ul><ul><li>Optimization Solver </li></ul><ul><ul><li>CFSQP </li></ul></ul><ul><ul><ul><li>Linear constraints & differentiable objective function </li></ul></ul></ul>Bigram Trigram Witten-Bell Kneser-Ney Witten-Bell Kneser-Ney Reference 8.47 8.36 8.21 8.08 MLS 8.44 8.38 8.24 8.12
  9. 9. Conclusion <ul><li>MLS has competitive performance </li></ul><ul><li>It can incorporate prior knowledge as a reference pmf </li></ul><ul><li>And additional good properties are proven </li></ul><ul><ul><li>Consistent estimation: </li></ul></ul><ul><ul><ul><li>MLS={MLE} under #samples = ∞ </li></ul></ul></ul><ul><ul><li>Faithful to the counts : c i <c j  p(i) ≤ p(j) </li></ul></ul>
  10. 10. (0, 0, 1) (0, 1/3, 2/3) (0, 2/3, 1/3) (0, 1, 0) (0, 2/3, 1/3) (0, 1/3, 2/3) (0, 0, 1) (2/3, 0, 1/3) (1/3, 0, 2/3) (1/3, 1/3, 1/3)

×