Web Science & Technologies
University of Koblenz ▪ Landau, Germany
Introduction to Kneser-Ney
Smoothing on Top of Generali...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
2 of 30
WeST
Content
 Introduction
 Language Models
 Gener...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
3 of 30
WeST
Content
 Introduction
 Language Models
 Gener...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
4 of 30
WeST
Introduction: Motivation
 Next word prediction:...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
5 of 30
WeST
Introduction to next word prediction
 How do we...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
6 of 30
WeST
Content
 Introduction
 Language Models
 Gener...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
7 of 30
WeST
Language models in general
 Language model: How...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
8 of 30
WeST
Conditional probabilities simplified
 Markov as...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
9 of 30
WeST
Definitions and Markov assumption
 n-gram: Sequ...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
10 of 30
WeST
Formalizing next word prediction
 Instead of 𝑃...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
11 of 30
WeST
How to calculate 𝑃(𝑤 𝑛|𝑤1
𝑛−1
)
 The easiest w...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
12 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
13 of 30
WeST
Intro Generalized Language Models (GLMs)
 Main...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
14 of 30
WeST
Why Generalized Language Models?
 Data sparsit...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
15 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
16 of 30
WeST
Smoothing
 Problem: Unseen sequences
 Try to ...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
17 of 30
WeST
Backoff smoothing
 If sequence unseen: use sho...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
18 of 30
WeST
Interpolated Smoothing
 Always use shorter seq...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
19 of 30
WeST
Kneser-Ney smoothing [KN95] intro
 Interpolate...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
20 of 30
WeST
Kneser-Ney smoothing idea
 For lower order cal...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
21 of 30
WeST
Kneser-Ney smoothing equation (highest)
 Highe...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
22 of 30
WeST
Kneser-Ney smoothing equation
 Lower order cal...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
23 of 30
WeST
Modified Kneser-Ney smoothing [CG98]
 Differen...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
24 of 30
WeST
Smoothing of GLMs
 We can use all smoothing te...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
25 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
26 of 30
WeST
Progress
 Done Yet:
 Extract text from XML fi...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
27 of 30
WeST
Content
 Introduction
 Language Models
 Gene...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
28 of 30
WeST
Summary
Data Sets Language Models Smoothing
• M...
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
29 of 30
WeST
Thank you for your attention!
Questions?
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
30 of 30
WeST
Sources
 Images:
 Wheelchair Joystick (Slide ...
Upcoming SlideShare
Loading in...5
×

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

856
-1

Published on

The first talk on the topic of my bachelor thesis with a focus on Kneser-Ney smoothing.

Published in: Technology, Education
2 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
856
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
17
Comments
2
Likes
2
Embeds 0
No embeds

No notes for slide

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

  1. 1. Web Science & Technologies University of Koblenz ▪ Landau, Germany Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction Martin Körner Oberseminar 25.07.2013
  2. 2. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 2 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  3. 3. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 3 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  4. 4. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 4 of 30 WeST Introduction: Motivation  Next word prediction: What is the next word a user will type?  Use cases for next word prediction:  Augmentative and Alternative Communication (AAC)  Small keyboards (Smartphones)
  5. 5. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 5 of 30 WeST Introduction to next word prediction  How do we predict words? 1. Rationalist approach • Manually encoding information about language • “Toy” problems only 2. Empiricist approach • Statistical, pattern recognition, and machine learning methods applied on corpora • Result: Language models
  6. 6. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 6 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  7. 7. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 7 of 30 WeST Language models in general  Language model: How likely is a sentence 𝑠?  Probability distribution: 𝑃 𝑠  Calculate 𝑃 𝑠 by multiplying conditional probabilities  Example: 𝑃 If you′ re going to San Francisco , be sure … = 𝑃 you′ re | If ∗ 𝑃 going | If you′ re ∗ 𝑃 to | If you′ re going ∗ 𝑃 San | If you′ re going to ∗ 𝑃 Francisco | If you′ re going to San ∗ ⋯  Empirical approach would fail
  8. 8. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 8 of 30 WeST Conditional probabilities simplified  Markov assumption [JM80]:  Only the last n-1 words are relevant for a prediction  Example with n=5: 𝑃 sure | If you′re going to San Francisco , be ≈ 𝑃 sure | San Francisco , be Counts as a word
  9. 9. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 9 of 30 WeST Definitions and Markov assumption  n-gram: Sequence of length n with a count  E.g.: 5-gram: If you′re going to San 4  Sequence naming: 𝑤1 𝑖−1 ≔ 𝑤1 𝑤2 … 𝑤𝑖−1  Markov assumption formalized: 𝑃 𝑤𝑖 𝑤1 𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 n-1 words
  10. 10. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 10 of 30 WeST Formalizing next word prediction  Instead of 𝑃(𝑠):  Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 • Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 to 𝑃 𝑤 𝑛 𝑤1 𝑛−1 NWP 𝑤1 𝑛−1 = arg max 𝑤 𝑛∈𝑊 𝑃 𝑤 𝑛 𝑤1 𝑛−1  How to calculate the probability 𝑃 𝑤 𝑛 𝑤1 𝑛−1 ? Set of all words in the corpus n-1 words n-1 words Conditional probability with Markov assumption
  11. 11. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 11 of 30 WeST How to calculate 𝑃(𝑤 𝑛|𝑤1 𝑛−1 )  The easiest way:  Maximum likelihood: 𝑃ML 𝑤 𝑛 𝑤1 𝑛−1 = 𝑐(𝑤1 𝑛 ) 𝑐(𝑤1 𝑛−1 )  Example: 𝑃 San | If you′ re going to = 𝑐 If you′re going to San 𝑐 If you′re going to
  12. 12. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 12 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  13. 13. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 13 of 30 WeST Intro Generalized Language Models (GLMs)  Main idea:  Insert wildcard words (∗) into sequences  Example:  Instead of 𝑃 San | If you′re going to : • 𝑃 San | If ∗ ∗ ∗ • 𝑃 San | If ∗ ∗ to • 𝑃 San | If ∗ going ∗ • 𝑃 San | If ∗ going to • 𝑃 San | If you′re ∗ ∗ • …  Separate different types of GLMs based on: 1. Sequence length 2. Number of wildcard words  Aggregate results Length: 5, Wildcard words: 2
  14. 14. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 14 of 30 WeST Why Generalized Language Models?  Data sparsity of n-grams  “If you′re going to San” is seen less often than for example “If ∗ ∗ to San”  Question: Does that really improve the prediction?  Result of evaluation: Yes … but we should use smoothing for language models
  15. 15. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 15 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  16. 16. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 16 of 30 WeST Smoothing  Problem: Unseen sequences  Try to estimate probabilities of unseen sequences  Probabilities of seen sequences need to be reduced  Two approaches: 1. Backoff smoothing 2. Interpolation smoothing
  17. 17. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 17 of 30 WeST Backoff smoothing  If sequence unseen: use shorter sequence  E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 > 0 𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖+1 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 = 0 Weight Lower order probability (recursive) Higher order probability
  18. 18. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 18 of 30 WeST Interpolated Smoothing  Always use shorter sequence for calculation 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Seems to work better than backoff smoothing Higher order probability Weight Lower order probability (recursive)
  19. 19. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 19 of 30 WeST Kneser-Ney smoothing [KN95] intro  Interpolated smoothing  Idea: Improve lower order calculation  Example: Word visiting unseen in corpus 𝑃 Francisco | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 Francisco 𝑃 San | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 San Result: Francisco is as likely as San at that position Is that correct?  Difference between Francisco and San? Answer: Number of different contexts
  20. 20. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 20 of 30 WeST Kneser-Ney smoothing idea  For lower order calculation:  Don’t use 𝑐 𝑤 𝑛  Instead: Number of different bigrams the word completes: 𝑁1+ • 𝑤 𝑛 ≔ 𝑤 𝑛−1: 𝑐 𝑤 𝑛−1 𝑛 > 0  Or in general: 𝑁1+ • 𝑤𝑖+1 𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖 𝑛 > 0  In addition:  𝑁1+ • 𝑤𝑖+1 𝑛−1 • = 𝑤 𝑛 𝑁1+ • 𝑤𝑖+1 𝑛  𝑁1+ 𝑤𝑖 𝑛−1 • = 𝑤 𝑛: 𝑐 𝑤𝑖 𝑛 > 0 Count
  21. 21. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 21 of 30 WeST Kneser-Ney smoothing equation (highest)  Highest order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑐 𝑤𝑖 𝑛 − 𝐷, 0} 𝑐 𝑤𝑖 𝑛−1 + 𝐷 𝑐 𝑤𝑖 𝑛−1 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1 count Total counts Assure positive value Discount value 0 ≤ 𝐷 ≤ 1 Lower order probability (recursion) Lower order weight
  22. 22. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 22 of 30 WeST Kneser-Ney smoothing equation  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷, 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Lowest order calculation: 𝑃KN 𝑤 𝑛 = 𝑁1+ •𝑤𝑖 𝑛 𝑁1+ •𝑤𝑖 𝑛−1• Continuation count Total continuation counts Assure positive value Discount value Lower order probability (recursion) Lower order weight
  23. 23. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 23 of 30 WeST Modified Kneser-Ney smoothing [CG98]  Different discount values for different absolute counts  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷(𝑐 𝑤𝑖 𝑛 ), 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷1 𝑁1 𝑤𝑖 𝑛−1 • + 𝐷2 𝑁2 𝑤𝑖 𝑛−1 • + 𝐷3+ 𝑁3+ 𝑤𝑖 𝑛−1 • 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  State of the art (since 15 years!)
  24. 24. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 24 of 30 WeST Smoothing of GLMs  We can use all smoothing techniques on GLMs as well!  Small modification: E.g: 𝑃 San | If ∗ going ∗ Lower order sequence : – Normally: 𝑃 San | ∗ going ∗ – Instead use 𝑃 San | going ∗
  25. 25. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 25 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  26. 26. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 26 of 30 WeST Progress  Done Yet:  Extract text from XML files  Building GLMs  Kneser-Ney and modified Kneser-Ney smoothing  Indexing with MySQL  ToDo’s  Finish evaluation program  Run evaluation  Analyze results
  27. 27. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 27 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  28. 28. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 28 of 30 WeST Summary Data Sets Language Models Smoothing • More Data • Better Data • Katz • Good-Turing • Witten-Bell • Kneser-Ney • … • n-grams • Generalized Language Models
  29. 29. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 29 of 30 WeST Thank you for your attention! Questions?
  30. 30. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 30 of 30 WeST Sources  Images:  Wheelchair Joystick (Slide 4): http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg  Smartphone Keyboard (Slide 4): https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg  References:  [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report, Technical Report TR-10- 98, Harvard University, August, 1998.  [JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381–397, 1980.  [KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1, pages 181–184. IEEE, 1995.
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×