Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

Web Science & Technologies
University of Koblenz ▪ Landau, Germany
Introduction to Kneser-Ney
Smoothing on Top of Generalized Language
Models for Next Word Prediction
Martin Körner
Oberseminar
25.07.2013

Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
2 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary

Martin Körner
3 of 30
WeST
Content
 Introduction
 Language Models
 Smoothing
 Progress
 Summary

Martin Körner
4 of 30
WeST
Introduction: Motivation
 Next word prediction: What is the next word a user will
type?
 Use cases for next word prediction:
 Augmentative and Alternative
Communication (AAC)
 Small keyboards (Smartphones)

Martin Körner
5 of 30
WeST
Introduction to next word prediction
 How do we predict words?
1. Rationalist approach
• Manually encoding information about language
• “Toy” problems only
2. Empiricist approach
• Statistical, pattern recognition, and machine learning
methods applied on corpora
• Result: Language models

Martin Körner
6 of 30
WeST
Content
 Introduction
 Language Models
 Smoothing
 Progress
 Summary

Martin Körner
7 of 30
WeST
Language models in general
 Language model: How likely is a sentence 𝑠?
 Probability distribution: 𝑃 𝑠
 Calculate 𝑃 𝑠 by multiplying conditional probabilities
 Example:
𝑃 If you′
re going to San Francisco , be sure …
=
𝑃 you′
re | If ∗ 𝑃 going | If you′
re ∗
𝑃 to | If you′
re going ∗ 𝑃 San | If you′
re going to ∗
𝑃 Francisco | If you′
re going to San ∗ ⋯
 Empirical approach would fail

Martin Körner
8 of 30
WeST
Conditional probabilities simplified
 Markov assumption [JM80]:
 Only the last n-1 words are relevant for a prediction
 Example with n=5:
𝑃 sure | If you′re going to San Francisco , be
≈ 𝑃 sure | San Francisco , be
Counts as a word

Martin Körner
9 of 30
WeST
Definitions and Markov assumption
 n-gram: Sequence of length n with a count
 E.g.: 5-gram:
If you′re going to San 4
 Sequence naming:
𝑤1
𝑖−1
≔ 𝑤1 𝑤2 … 𝑤𝑖−1
 Markov assumption formalized:
𝑃 𝑤𝑖 𝑤1
𝑖−1
≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
n-1 words

Martin Körner
10 of 30
WeST
Formalizing next word prediction
 Instead of 𝑃(𝑠):
 Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
• Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
to 𝑃 𝑤 𝑛 𝑤1
𝑛−1
NWP 𝑤1
𝑛−1
= arg max 𝑤 𝑛∈𝑊 𝑃 𝑤 𝑛 𝑤1
𝑛−1
 How to calculate the probability 𝑃 𝑤 𝑛 𝑤1
𝑛−1
?
Set of all words in the corpus
n-1 words n-1 words
Conditional probability with Markov assumption

Martin Körner
11 of 30
WeST
How to calculate 𝑃(𝑤 𝑛|𝑤1
𝑛−1
)
 The easiest way:
 Maximum likelihood:
𝑃ML 𝑤 𝑛 𝑤1
𝑛−1
=
𝑐(𝑤1
𝑛
)
𝑐(𝑤1
𝑛−1
)
 Example:
𝑃 San | If you′
re going to =
𝑐 If you′re going to San
𝑐 If you′re going to

Martin Körner
12 of 30
WeST
Content
 Introduction
 Language Models
 Smoothing
 Progress
 Summary

Martin Körner
13 of 30
WeST
Intro Generalized Language Models (GLMs)
 Main idea:
 Insert wildcard words (∗) into sequences
 Example:
 Instead of 𝑃 San | If you′re going to :
• 𝑃 San | If ∗ ∗ ∗
• 𝑃 San | If ∗ ∗ to
• 𝑃 San | If ∗ going ∗
• 𝑃 San | If ∗ going to
• 𝑃 San | If you′re ∗ ∗
• …
 Separate different types of GLMs based on:
1. Sequence length
2. Number of wildcard words
 Aggregate results
Length: 5, Wildcard words: 2

Martin Körner
14 of 30
WeST
Why Generalized Language Models?
 Data sparsity of n-grams
 “If you′re going to San” is seen less often than for example
“If ∗ ∗ to San”
 Question: Does that really improve the prediction?
 Result of evaluation: Yes
… but we should use smoothing for language models

Martin Körner
15 of 30
WeST
Content
 Introduction
 Language Models
 Smoothing
 Progress
 Summary

Martin Körner
16 of 30
WeST
Smoothing
 Problem: Unseen sequences
 Try to estimate probabilities of unseen sequences
 Probabilities of seen sequences need to be reduced
 Two approaches:
1. Backoff smoothing
2. Interpolation smoothing

Martin Körner
17 of 30
WeST
Backoff smoothing
 If sequence unseen: use shorter sequence
 E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to
𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖
𝑛−1
=
𝜏 𝑤 𝑛 𝑤𝑖
𝑛−1
𝑖𝑓 𝑐 𝑤𝑖
𝑛
> 0
𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖+1
𝑛−1
𝑖𝑓 𝑐 𝑤𝑖
𝑛
= 0
Weight Lower order
probability (recursive)
Higher order
probability

Martin Körner
18 of 30
WeST
Interpolated Smoothing
 Always use shorter sequence for calculation
𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖
𝑛−1
= 𝜏 𝑤 𝑛 𝑤𝑖
𝑛−1
+ 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖+1
𝑛−1
 Seems to work better than backoff smoothing
Higher order
probability
Weight Lower order
probability (recursive)

Martin Körner
19 of 30
WeST
Kneser-Ney smoothing [KN95] intro
 Interpolated smoothing
 Idea: Improve lower order calculation
 Example: Word visiting unseen in corpus
𝑃 Francisco | visiting = 0
 Normal interpolation: 0 + γ ∗ 𝑃 Francisco
𝑃 San | visiting = 0
 Normal interpolation: 0 + γ ∗ 𝑃 San
Result: Francisco is as likely as San at that position
Is that correct?
 Difference between Francisco and San?
Answer: Number of different contexts

Martin Körner
20 of 30
WeST
Kneser-Ney smoothing idea
 For lower order calculation:
 Don’t use 𝑐 𝑤 𝑛
 Instead: Number of different bigrams the word completes:
𝑁1+ • 𝑤 𝑛 ≔ 𝑤 𝑛−1: 𝑐 𝑤 𝑛−1
𝑛
> 0
 Or in general:
𝑁1+ • 𝑤𝑖+1
𝑛
= 𝑤𝑖: 𝑐 𝑤𝑖
𝑛
> 0
 In addition:
 𝑁1+ • 𝑤𝑖+1
𝑛−1
• = 𝑤 𝑛
𝑁1+ • 𝑤𝑖+1
𝑛
 𝑁1+ 𝑤𝑖
𝑛−1
• = 𝑤 𝑛: 𝑐 𝑤𝑖
𝑛
> 0
Count

Martin Körner
21 of 30
WeST
Kneser-Ney smoothing equation (highest)
 Highest order calculation:
𝑃KN 𝑤 𝑛 𝑤𝑖
𝑛−1
=
max{𝑐 𝑤𝑖
𝑛
− 𝐷, 0}
𝑐 𝑤𝑖
𝑛−1
+
𝐷
𝑐 𝑤𝑖
𝑛−1
𝑁1+ 𝑤𝑖
𝑛−1
• 𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
count
Total counts
Assure positive value
Discount value
0 ≤ 𝐷 ≤ 1
Lower order probability
(recursion)
Lower order weight

Martin Körner
22 of 30
WeST
Kneser-Ney smoothing equation
 Lower order calculation:
𝑛−1
=
max{𝑁1+ • 𝑤𝑖
𝑛
− 𝐷, 0}
𝑁1+ • 𝑤𝑖
𝑛−1
•
+
𝐷
𝑁1+ • 𝑤𝑖
𝑛−1
•
𝑁1+ 𝑤𝑖
𝑛−1
• 𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
 Lowest order calculation: 𝑃KN 𝑤 𝑛 =
𝑁1+ •𝑤𝑖
𝑛
𝑁1+ •𝑤𝑖
𝑛−1•
Continuation count
Total continuation counts
Assure positive value
Discount value
Lower order probability
(recursion)
Lower order weight

Martin Körner
23 of 30
WeST
Modified Kneser-Ney smoothing [CG98]
 Different discount values for different absolute counts
 Lower order calculation:
𝑛−1
=
max{𝑁1+ • 𝑤𝑖
𝑛
− 𝐷(𝑐 𝑤𝑖
𝑛
), 0}
𝑁1+ • 𝑤𝑖
𝑛−1
•
+
𝐷1 𝑁1 𝑤𝑖
𝑛−1
• + 𝐷2 𝑁2 𝑤𝑖
𝑛−1
• + 𝐷3+ 𝑁3+ 𝑤𝑖
𝑛−1
•
𝑁1+ • 𝑤𝑖
𝑛−1
•
𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
 State of the art (since 15 years!)

Martin Körner
24 of 30
WeST
Smoothing of GLMs
 We can use all smoothing techniques on GLMs as well!
 Small modification:
E.g: 𝑃 San | If ∗ going ∗
Lower order sequence :
– Normally: 𝑃 San | ∗ going ∗
– Instead use 𝑃 San | going ∗

Martin Körner
25 of 30
WeST
Content
 Introduction
 Language Models
 Smoothing
 Progress
 Summary

Martin Körner
26 of 30
WeST
Progress
 Done Yet:
 Extract text from XML files
 Building GLMs
 Kneser-Ney and modified Kneser-Ney smoothing
 Indexing with MySQL
 ToDo’s
 Finish evaluation program
 Run evaluation
 Analyze results

Martin Körner
27 of 30
WeST
Content
 Introduction
 Language Models
 Smoothing
 Progress
 Summary

Martin Körner
28 of 30
WeST
Summary
Data Sets Language Models Smoothing
• More Data
• Better Data
• Katz
• Good-Turing
• Witten-Bell
• Kneser-Ney
• …
• n-grams
• Generalized
Language
Models

Martin Körner
29 of 30
WeST
Thank you for your attention!
Questions?

Martin Körner
30 of 30
WeST
Sources
 Images:
 Wheelchair Joystick (Slide 4):
http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg
 Smartphone Keyboard (Slide 4):
https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg
 References:
 [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing
techniques for language modeling. Technical report, Technical Report TR-10-
98, Harvard University, August, 1998.
 [JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern
Recognition in Practice, pages 381–397, 1980.
 [KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram
language modeling. In Acoustics, Speech, and Signal Processing, 1995.
ICASSP-95., 1995 International Conference on, volume 1, pages 181–184.
IEEE, 1995.

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

Similar to Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction (8)

Recently uploaded

Recently uploaded (20)

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction