Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Web Science & Technologies
University of Koblenz ▪ Landau, Germany
Introduction to Kneser-Ney
Smoothing on Top of Generali...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
2 of 30
WeST
Content
 Introduction
 Language Models
 Gener...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
3 of 30
WeST
Content
 Introduction
 Language Models
 Gener...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
4 of 30
WeST
Introduction: Motivation
 Next word prediction:...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
5 of 30
WeST
Introduction to next word prediction
 How do we...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
6 of 30
WeST
Content
 Introduction
 Language Models
 Gener...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
7 of 30
WeST
Language models in general
 Language model: How...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
8 of 30
WeST
Conditional probabilities simplified
 Markov as...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
9 of 30
WeST
Definitions and Markov assumption
 n-gram: Sequ...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
10 of 30
WeST
Formalizing next word prediction
 Instead of 𝑃...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
11 of 30
WeST
How to calculate 𝑃(𝑤 𝑛|𝑤1
𝑛−1
)
 The easiest w...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
12 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
13 of 30
WeST
Intro Generalized Language Models (GLMs)
 Main...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
14 of 30
WeST
Why Generalized Language Models?
 Data sparsit...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
15 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
16 of 30
WeST
Smoothing
 Problem: Unseen sequences
 Try to ...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
17 of 30
WeST
Backoff smoothing
 If sequence unseen: use sho...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
18 of 30
WeST
Interpolated Smoothing
 Always use shorter seq...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
19 of 30
WeST
Kneser-Ney smoothing [KN95] intro
 Interpolate...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
20 of 30
WeST
Kneser-Ney smoothing idea
 For lower order cal...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
21 of 30
WeST
Kneser-Ney smoothing equation (highest)
 Highe...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
22 of 30
WeST
Kneser-Ney smoothing equation
 Lower order cal...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
23 of 30
WeST
Modified Kneser-Ney smoothing [CG98]
 Differen...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
24 of 30
WeST
Smoothing of GLMs
 We can use all smoothing te...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
25 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
26 of 30
WeST
Progress
 Done Yet:
 Extract text from XML fi...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
27 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
28 of 30
WeST
Summary
Data Sets Language Models Smoothing
• M...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
29 of 30
WeST
Thank you for your attention!
Questions?
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
30 of 30
WeST
Sources
 Images:
 Wheelchair Joystick (Slide ...
Upcoming SlideShare
Loading in …5
×

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

2,235 views

Published on

The first talk on the topic of my bachelor thesis with a focus on Kneser-Ney smoothing.

Published in: Technology, Education

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

  1. 1. Web Science & Technologies University of Koblenz ▪ Landau, Germany Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction Martin Körner Oberseminar 25.07.2013
  2. 2. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 2 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  3. 3. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 3 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  4. 4. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 4 of 30 WeST Introduction: Motivation  Next word prediction: What is the next word a user will type?  Use cases for next word prediction:  Augmentative and Alternative Communication (AAC)  Small keyboards (Smartphones)
  5. 5. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 5 of 30 WeST Introduction to next word prediction  How do we predict words? 1. Rationalist approach • Manually encoding information about language • “Toy” problems only 2. Empiricist approach • Statistical, pattern recognition, and machine learning methods applied on corpora • Result: Language models
  6. 6. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 6 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  7. 7. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 7 of 30 WeST Language models in general  Language model: How likely is a sentence 𝑠?  Probability distribution: 𝑃 𝑠  Calculate 𝑃 𝑠 by multiplying conditional probabilities  Example: 𝑃 If you′ re going to San Francisco , be sure … = 𝑃 you′ re | If ∗ 𝑃 going | If you′ re ∗ 𝑃 to | If you′ re going ∗ 𝑃 San | If you′ re going to ∗ 𝑃 Francisco | If you′ re going to San ∗ ⋯  Empirical approach would fail
  8. 8. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 8 of 30 WeST Conditional probabilities simplified  Markov assumption [JM80]:  Only the last n-1 words are relevant for a prediction  Example with n=5: 𝑃 sure | If you′re going to San Francisco , be ≈ 𝑃 sure | San Francisco , be Counts as a word
  9. 9. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 9 of 30 WeST Definitions and Markov assumption  n-gram: Sequence of length n with a count  E.g.: 5-gram: If you′re going to San 4  Sequence naming: 𝑤1 𝑖−1 ≔ 𝑤1 𝑤2 … 𝑤𝑖−1  Markov assumption formalized: 𝑃 𝑤𝑖 𝑤1 𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 n-1 words
  10. 10. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 10 of 30 WeST Formalizing next word prediction  Instead of 𝑃(𝑠):  Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 • Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 to 𝑃 𝑤 𝑛 𝑤1 𝑛−1 NWP 𝑤1 𝑛−1 = arg max 𝑤 𝑛∈𝑊 𝑃 𝑤 𝑛 𝑤1 𝑛−1  How to calculate the probability 𝑃 𝑤 𝑛 𝑤1 𝑛−1 ? Set of all words in the corpus n-1 words n-1 words Conditional probability with Markov assumption
  11. 11. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 11 of 30 WeST How to calculate 𝑃(𝑤 𝑛|𝑤1 𝑛−1 )  The easiest way:  Maximum likelihood: 𝑃ML 𝑤 𝑛 𝑤1 𝑛−1 = 𝑐(𝑤1 𝑛 ) 𝑐(𝑤1 𝑛−1 )  Example: 𝑃 San | If you′ re going to = 𝑐 If you′re going to San 𝑐 If you′re going to
  12. 12. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 12 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  13. 13. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 13 of 30 WeST Intro Generalized Language Models (GLMs)  Main idea:  Insert wildcard words (∗) into sequences  Example:  Instead of 𝑃 San | If you′re going to : • 𝑃 San | If ∗ ∗ ∗ • 𝑃 San | If ∗ ∗ to • 𝑃 San | If ∗ going ∗ • 𝑃 San | If ∗ going to • 𝑃 San | If you′re ∗ ∗ • …  Separate different types of GLMs based on: 1. Sequence length 2. Number of wildcard words  Aggregate results Length: 5, Wildcard words: 2
  14. 14. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 14 of 30 WeST Why Generalized Language Models?  Data sparsity of n-grams  “If you′re going to San” is seen less often than for example “If ∗ ∗ to San”  Question: Does that really improve the prediction?  Result of evaluation: Yes … but we should use smoothing for language models
  15. 15. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 15 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  16. 16. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 16 of 30 WeST Smoothing  Problem: Unseen sequences  Try to estimate probabilities of unseen sequences  Probabilities of seen sequences need to be reduced  Two approaches: 1. Backoff smoothing 2. Interpolation smoothing
  17. 17. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 17 of 30 WeST Backoff smoothing  If sequence unseen: use shorter sequence  E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 > 0 𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖+1 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 = 0 Weight Lower order probability (recursive) Higher order probability
  18. 18. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 18 of 30 WeST Interpolated Smoothing  Always use shorter sequence for calculation 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Seems to work better than backoff smoothing Higher order probability Weight Lower order probability (recursive)
  19. 19. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 19 of 30 WeST Kneser-Ney smoothing [KN95] intro  Interpolated smoothing  Idea: Improve lower order calculation  Example: Word visiting unseen in corpus 𝑃 Francisco | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 Francisco 𝑃 San | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 San Result: Francisco is as likely as San at that position Is that correct?  Difference between Francisco and San? Answer: Number of different contexts
  20. 20. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 20 of 30 WeST Kneser-Ney smoothing idea  For lower order calculation:  Don’t use 𝑐 𝑤 𝑛  Instead: Number of different bigrams the word completes: 𝑁1+ • 𝑤 𝑛 ≔ 𝑤 𝑛−1: 𝑐 𝑤 𝑛−1 𝑛 > 0  Or in general: 𝑁1+ • 𝑤𝑖+1 𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖 𝑛 > 0  In addition:  𝑁1+ • 𝑤𝑖+1 𝑛−1 • = 𝑤 𝑛 𝑁1+ • 𝑤𝑖+1 𝑛  𝑁1+ 𝑤𝑖 𝑛−1 • = 𝑤 𝑛: 𝑐 𝑤𝑖 𝑛 > 0 Count
  21. 21. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 21 of 30 WeST Kneser-Ney smoothing equation (highest)  Highest order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑐 𝑤𝑖 𝑛 − 𝐷, 0} 𝑐 𝑤𝑖 𝑛−1 + 𝐷 𝑐 𝑤𝑖 𝑛−1 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1 count Total counts Assure positive value Discount value 0 ≤ 𝐷 ≤ 1 Lower order probability (recursion) Lower order weight
  22. 22. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 22 of 30 WeST Kneser-Ney smoothing equation  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷, 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Lowest order calculation: 𝑃KN 𝑤 𝑛 = 𝑁1+ •𝑤𝑖 𝑛 𝑁1+ •𝑤𝑖 𝑛−1• Continuation count Total continuation counts Assure positive value Discount value Lower order probability (recursion) Lower order weight
  23. 23. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 23 of 30 WeST Modified Kneser-Ney smoothing [CG98]  Different discount values for different absolute counts  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷(𝑐 𝑤𝑖 𝑛 ), 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷1 𝑁1 𝑤𝑖 𝑛−1 • + 𝐷2 𝑁2 𝑤𝑖 𝑛−1 • + 𝐷3+ 𝑁3+ 𝑤𝑖 𝑛−1 • 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  State of the art (since 15 years!)
  24. 24. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 24 of 30 WeST Smoothing of GLMs  We can use all smoothing techniques on GLMs as well!  Small modification: E.g: 𝑃 San | If ∗ going ∗ Lower order sequence : – Normally: 𝑃 San | ∗ going ∗ – Instead use 𝑃 San | going ∗
  25. 25. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 25 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  26. 26. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 26 of 30 WeST Progress  Done Yet:  Extract text from XML files  Building GLMs  Kneser-Ney and modified Kneser-Ney smoothing  Indexing with MySQL  ToDo’s  Finish evaluation program  Run evaluation  Analyze results
  27. 27. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 27 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  28. 28. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 28 of 30 WeST Summary Data Sets Language Models Smoothing • More Data • Better Data • Katz • Good-Turing • Witten-Bell • Kneser-Ney • … • n-grams • Generalized Language Models
  29. 29. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 29 of 30 WeST Thank you for your attention! Questions?
  30. 30. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 30 of 30 WeST Sources  Images:  Wheelchair Joystick (Slide 4): http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg  Smartphone Keyboard (Slide 4): https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg  References:  [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report, Technical Report TR-10- 98, Harvard University, August, 1998.  [JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381–397, 1980.  [KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1, pages 181–184. IEEE, 1995.

×