Upcoming SlideShare
×

# Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

856
-1

Published on

The first talk on the topic of my bachelor thesis with a focus on Kneser-Ney smoothing.

Published in: Technology, Education
2 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• @Sysmap Solutions We used the unigram counts in our implementation.

Are you sure you want to  Yes  No
• For kneser ney smoothing, do you use the unigram counts in the recursive formula or finish at the bigrams continuation counts?

Are you sure you want to  Yes  No
Views
Total Views
856
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
17
2
Likes
2
Embeds 0
No embeds

No notes for slide

### Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

1. 1. Web Science & Technologies University of Koblenz ▪ Landau, Germany Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction Martin Körner Oberseminar 25.07.2013
2. 2. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 2 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
3. 3. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 3 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
4. 4. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 4 of 30 WeST Introduction: Motivation  Next word prediction: What is the next word a user will type?  Use cases for next word prediction:  Augmentative and Alternative Communication (AAC)  Small keyboards (Smartphones)
5. 5. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 5 of 30 WeST Introduction to next word prediction  How do we predict words? 1. Rationalist approach • Manually encoding information about language • “Toy” problems only 2. Empiricist approach • Statistical, pattern recognition, and machine learning methods applied on corpora • Result: Language models
6. 6. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 6 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
7. 7. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 7 of 30 WeST Language models in general  Language model: How likely is a sentence 𝑠?  Probability distribution: 𝑃 𝑠  Calculate 𝑃 𝑠 by multiplying conditional probabilities  Example: 𝑃 If you′ re going to San Francisco , be sure … = 𝑃 you′ re | If ∗ 𝑃 going | If you′ re ∗ 𝑃 to | If you′ re going ∗ 𝑃 San | If you′ re going to ∗ 𝑃 Francisco | If you′ re going to San ∗ ⋯  Empirical approach would fail
8. 8. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 8 of 30 WeST Conditional probabilities simplified  Markov assumption [JM80]:  Only the last n-1 words are relevant for a prediction  Example with n=5: 𝑃 sure | If you′re going to San Francisco , be ≈ 𝑃 sure | San Francisco , be Counts as a word
9. 9. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 9 of 30 WeST Definitions and Markov assumption  n-gram: Sequence of length n with a count  E.g.: 5-gram: If you′re going to San 4  Sequence naming: 𝑤1 𝑖−1 ≔ 𝑤1 𝑤2 … 𝑤𝑖−1  Markov assumption formalized: 𝑃 𝑤𝑖 𝑤1 𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 n-1 words
10. 10. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 10 of 30 WeST Formalizing next word prediction  Instead of 𝑃(𝑠):  Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 • Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 to 𝑃 𝑤 𝑛 𝑤1 𝑛−1 NWP 𝑤1 𝑛−1 = arg max 𝑤 𝑛∈𝑊 𝑃 𝑤 𝑛 𝑤1 𝑛−1  How to calculate the probability 𝑃 𝑤 𝑛 𝑤1 𝑛−1 ? Set of all words in the corpus n-1 words n-1 words Conditional probability with Markov assumption
11. 11. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 11 of 30 WeST How to calculate 𝑃(𝑤 𝑛|𝑤1 𝑛−1 )  The easiest way:  Maximum likelihood: 𝑃ML 𝑤 𝑛 𝑤1 𝑛−1 = 𝑐(𝑤1 𝑛 ) 𝑐(𝑤1 𝑛−1 )  Example: 𝑃 San | If you′ re going to = 𝑐 If you′re going to San 𝑐 If you′re going to
12. 12. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 12 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
13. 13. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 13 of 30 WeST Intro Generalized Language Models (GLMs)  Main idea:  Insert wildcard words (∗) into sequences  Example:  Instead of 𝑃 San | If you′re going to : • 𝑃 San | If ∗ ∗ ∗ • 𝑃 San | If ∗ ∗ to • 𝑃 San | If ∗ going ∗ • 𝑃 San | If ∗ going to • 𝑃 San | If you′re ∗ ∗ • …  Separate different types of GLMs based on: 1. Sequence length 2. Number of wildcard words  Aggregate results Length: 5, Wildcard words: 2
14. 14. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 14 of 30 WeST Why Generalized Language Models?  Data sparsity of n-grams  “If you′re going to San” is seen less often than for example “If ∗ ∗ to San”  Question: Does that really improve the prediction?  Result of evaluation: Yes … but we should use smoothing for language models
15. 15. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 15 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
16. 16. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 16 of 30 WeST Smoothing  Problem: Unseen sequences  Try to estimate probabilities of unseen sequences  Probabilities of seen sequences need to be reduced  Two approaches: 1. Backoff smoothing 2. Interpolation smoothing
17. 17. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 17 of 30 WeST Backoff smoothing  If sequence unseen: use shorter sequence  E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 > 0 𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖+1 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 = 0 Weight Lower order probability (recursive) Higher order probability
18. 18. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 18 of 30 WeST Interpolated Smoothing  Always use shorter sequence for calculation 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Seems to work better than backoff smoothing Higher order probability Weight Lower order probability (recursive)
19. 19. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 19 of 30 WeST Kneser-Ney smoothing [KN95] intro  Interpolated smoothing  Idea: Improve lower order calculation  Example: Word visiting unseen in corpus 𝑃 Francisco | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 Francisco 𝑃 San | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 San Result: Francisco is as likely as San at that position Is that correct?  Difference between Francisco and San? Answer: Number of different contexts
20. 20. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 20 of 30 WeST Kneser-Ney smoothing idea  For lower order calculation:  Don’t use 𝑐 𝑤 𝑛  Instead: Number of different bigrams the word completes: 𝑁1+ • 𝑤 𝑛 ≔ 𝑤 𝑛−1: 𝑐 𝑤 𝑛−1 𝑛 > 0  Or in general: 𝑁1+ • 𝑤𝑖+1 𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖 𝑛 > 0  In addition:  𝑁1+ • 𝑤𝑖+1 𝑛−1 • = 𝑤 𝑛 𝑁1+ • 𝑤𝑖+1 𝑛  𝑁1+ 𝑤𝑖 𝑛−1 • = 𝑤 𝑛: 𝑐 𝑤𝑖 𝑛 > 0 Count
21. 21. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 21 of 30 WeST Kneser-Ney smoothing equation (highest)  Highest order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑐 𝑤𝑖 𝑛 − 𝐷, 0} 𝑐 𝑤𝑖 𝑛−1 + 𝐷 𝑐 𝑤𝑖 𝑛−1 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1 count Total counts Assure positive value Discount value 0 ≤ 𝐷 ≤ 1 Lower order probability (recursion) Lower order weight
22. 22. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 22 of 30 WeST Kneser-Ney smoothing equation  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷, 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Lowest order calculation: 𝑃KN 𝑤 𝑛 = 𝑁1+ •𝑤𝑖 𝑛 𝑁1+ •𝑤𝑖 𝑛−1• Continuation count Total continuation counts Assure positive value Discount value Lower order probability (recursion) Lower order weight
23. 23. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 23 of 30 WeST Modified Kneser-Ney smoothing [CG98]  Different discount values for different absolute counts  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷(𝑐 𝑤𝑖 𝑛 ), 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷1 𝑁1 𝑤𝑖 𝑛−1 • + 𝐷2 𝑁2 𝑤𝑖 𝑛−1 • + 𝐷3+ 𝑁3+ 𝑤𝑖 𝑛−1 • 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  State of the art (since 15 years!)
24. 24. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 24 of 30 WeST Smoothing of GLMs  We can use all smoothing techniques on GLMs as well!  Small modification: E.g: 𝑃 San | If ∗ going ∗ Lower order sequence : – Normally: 𝑃 San | ∗ going ∗ – Instead use 𝑃 San | going ∗
25. 25. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 25 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
26. 26. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 26 of 30 WeST Progress  Done Yet:  Extract text from XML files  Building GLMs  Kneser-Ney and modified Kneser-Ney smoothing  Indexing with MySQL  ToDo’s  Finish evaluation program  Run evaluation  Analyze results
27. 27. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 27 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
28. 28. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 28 of 30 WeST Summary Data Sets Language Models Smoothing • More Data • Better Data • Katz • Good-Turing • Witten-Bell • Kneser-Ney • … • n-grams • Generalized Language Models
29. 29. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 29 of 30 WeST Thank you for your attention! Questions?
30. 30. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 30 of 30 WeST Sources  Images:  Wheelchair Joystick (Slide 4): http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg  Smartphone Keyboard (Slide 4): https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg  References:  [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report, Technical Report TR-10- 98, Harvard University, August, 1998.  [JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381–397, 1980.  [KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1, pages 181–184. IEEE, 1995.
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.