Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Maximum likelihood-set - introduction

391
views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
391
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
0
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Language Modeling with the Maximum Likelihood Set (Karakos & Khudanpur, ISIT-2006) http://dx.doi.org/10.1109/ISIT.2006.261575 Yusuke Matsubara Tsujii lab. Meeting 2006-06-22
• 2. Necessity of smoothing
• Estimation from small samples
• Suppose
• The possible word set is { A , B , C }
• Maximum likelihood estimation (unigram)
• The count of word C is accidentally 0 in the corpus
• Like “ A B A A B ”
• MLE predicts that C will never occur
• Even we know C can occur
• This is underestimation
• 3. The Maximum Likelihood Set [ Jedynak & Khudanpur 2005 ]
• Given a set of words and word counts (= a corpus)
• The true pmf should predict that the given corpus is more probable than any other corpus of the same size
• MLS contains all such pmf
p 1 +p 2 +p 3 =1 MLEs from possible counts 1 1 1 0 Sample size=3 #(word set) =3
• 4. The Maximum Likelihood Set (formal definition) k 2 linear inequality constraints
• 5. The Maximum Likelihood Set n = 3 (#samples) k = 3 (#word set) n = 10 (#samples) k = 3 (#word set) Larger samples Nearer to MLE
• 6. Choosing a pmf from a MLS
• Assume a reference pmf
• Choose a pmf that minimize KL-divergence to the reference pmf
reference MLS
• 7. Conditional pmf estimation
• Different MLS for each condition
• In case of trigram language modeling,
• |V| 2 MLSs
• And each MLS has |V| 2 constraints
• However, we can remove many redundant constraints
• 8. Experimental results
• Corpus
• UPenn Treebank
• Sect 00-22 (900K words) for training
• Sect 23-24 (100K words) for testing
• Evaluation
• Code length of a word (entropy)
• Optimization Solver
• CFSQP
• Linear constraints & differentiable objective function
Bigram Trigram Witten-Bell Kneser-Ney Witten-Bell Kneser-Ney Reference 8.47 8.36 8.21 8.08 MLS 8.44 8.38 8.24 8.12
• 9. Conclusion
• MLS has competitive performance
• It can incorporate prior knowledge as a reference pmf
• And additional good properties are proven
• Consistent estimation:
• MLS={MLE} under #samples = ∞
• Faithful to the counts : c i <c j  p(i) ≤ p(j)
• 10. (0, 0, 1) (0, 1/3, 2/3) (0, 2/3, 1/3) (0, 1, 0) (0, 2/3, 1/3) (0, 1/3, 2/3) (0, 0, 1) (2/3, 0, 1/3) (1/3, 0, 2/3) (1/3, 1/3, 1/3)
• 11.
• 12.
• 13.