SlideShare a Scribd company logo
Web Science & Technologies
University of Koblenz ▪ Landau, Germany
Introduction to Kneser-Ney
Smoothing on Top of Generalized Language
Models for Next Word Prediction
Martin Körner
Oberseminar
25.07.2013
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
2 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
3 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
4 of 30
WeST
Introduction: Motivation
 Next word prediction: What is the next word a user will
type?
 Use cases for next word prediction:
 Augmentative and Alternative
Communication (AAC)
 Small keyboards (Smartphones)
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
5 of 30
WeST
Introduction to next word prediction
 How do we predict words?
1. Rationalist approach
• Manually encoding information about language
• “Toy” problems only
2. Empiricist approach
• Statistical, pattern recognition, and machine learning
methods applied on corpora
• Result: Language models
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
6 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
7 of 30
WeST
Language models in general
 Language model: How likely is a sentence 𝑠?
 Probability distribution: 𝑃 𝑠
 Calculate 𝑃 𝑠 by multiplying conditional probabilities
 Example:
𝑃 If you′
re going to San Francisco , be sure …
=
𝑃 you′
re | If ∗ 𝑃 going | If you′
re ∗
𝑃 to | If you′
re going ∗ 𝑃 San | If you′
re going to ∗
𝑃 Francisco | If you′
re going to San ∗ ⋯
 Empirical approach would fail
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
8 of 30
WeST
Conditional probabilities simplified
 Markov assumption [JM80]:
 Only the last n-1 words are relevant for a prediction
 Example with n=5:
𝑃 sure | If you′re going to San Francisco , be
≈ 𝑃 sure | San Francisco , be
Counts as a word
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
9 of 30
WeST
Definitions and Markov assumption
 n-gram: Sequence of length n with a count
 E.g.: 5-gram:
If you′re going to San 4
 Sequence naming:
𝑤1
𝑖−1
≔ 𝑤1 𝑤2 … 𝑤𝑖−1
 Markov assumption formalized:
𝑃 𝑤𝑖 𝑤1
𝑖−1
≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
n-1 words
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
10 of 30
WeST
Formalizing next word prediction
 Instead of 𝑃(𝑠):
 Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
• Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1
𝑖−1
to 𝑃 𝑤 𝑛 𝑤1
𝑛−1
NWP 𝑤1
𝑛−1
= arg max 𝑤 𝑛∈𝑊 𝑃 𝑤 𝑛 𝑤1
𝑛−1
 How to calculate the probability 𝑃 𝑤 𝑛 𝑤1
𝑛−1
?
Set of all words in the corpus
n-1 words n-1 words
Conditional probability with Markov assumption
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
11 of 30
WeST
How to calculate 𝑃(𝑤 𝑛|𝑤1
𝑛−1
)
 The easiest way:
 Maximum likelihood:
𝑃ML 𝑤 𝑛 𝑤1
𝑛−1
=
𝑐(𝑤1
𝑛
)
𝑐(𝑤1
𝑛−1
)
 Example:
𝑃 San | If you′
re going to =
𝑐 If you′re going to San
𝑐 If you′re going to
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
12 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
13 of 30
WeST
Intro Generalized Language Models (GLMs)
 Main idea:
 Insert wildcard words (∗) into sequences
 Example:
 Instead of 𝑃 San | If you′re going to :
• 𝑃 San | If ∗ ∗ ∗
• 𝑃 San | If ∗ ∗ to
• 𝑃 San | If ∗ going ∗
• 𝑃 San | If ∗ going to
• 𝑃 San | If you′re ∗ ∗
• …
 Separate different types of GLMs based on:
1. Sequence length
2. Number of wildcard words
 Aggregate results
Length: 5, Wildcard words: 2
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
14 of 30
WeST
Why Generalized Language Models?
 Data sparsity of n-grams
 “If you′re going to San” is seen less often than for example
“If ∗ ∗ to San”
 Question: Does that really improve the prediction?
 Result of evaluation: Yes
… but we should use smoothing for language models
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
15 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
16 of 30
WeST
Smoothing
 Problem: Unseen sequences
 Try to estimate probabilities of unseen sequences
 Probabilities of seen sequences need to be reduced
 Two approaches:
1. Backoff smoothing
2. Interpolation smoothing
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
17 of 30
WeST
Backoff smoothing
 If sequence unseen: use shorter sequence
 E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to
𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖
𝑛−1
=
𝜏 𝑤 𝑛 𝑤𝑖
𝑛−1
𝑖𝑓 𝑐 𝑤𝑖
𝑛
> 0
𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖+1
𝑛−1
𝑖𝑓 𝑐 𝑤𝑖
𝑛
= 0
Weight Lower order
probability (recursive)
Higher order
probability
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
18 of 30
WeST
Interpolated Smoothing
 Always use shorter sequence for calculation
𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖
𝑛−1
= 𝜏 𝑤 𝑛 𝑤𝑖
𝑛−1
+ 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖+1
𝑛−1
 Seems to work better than backoff smoothing
Higher order
probability
Weight Lower order
probability (recursive)
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
19 of 30
WeST
Kneser-Ney smoothing [KN95] intro
 Interpolated smoothing
 Idea: Improve lower order calculation
 Example: Word visiting unseen in corpus
𝑃 Francisco | visiting = 0
 Normal interpolation: 0 + γ ∗ 𝑃 Francisco
𝑃 San | visiting = 0
 Normal interpolation: 0 + γ ∗ 𝑃 San
Result: Francisco is as likely as San at that position
Is that correct?
 Difference between Francisco and San?
Answer: Number of different contexts
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
20 of 30
WeST
Kneser-Ney smoothing idea
 For lower order calculation:
 Don’t use 𝑐 𝑤 𝑛
 Instead: Number of different bigrams the word completes:
𝑁1+ • 𝑤 𝑛 ≔ 𝑤 𝑛−1: 𝑐 𝑤 𝑛−1
𝑛
> 0
 Or in general:
𝑁1+ • 𝑤𝑖+1
𝑛
= 𝑤𝑖: 𝑐 𝑤𝑖
𝑛
> 0
 In addition:
 𝑁1+ • 𝑤𝑖+1
𝑛−1
• = 𝑤 𝑛
𝑁1+ • 𝑤𝑖+1
𝑛
 𝑁1+ 𝑤𝑖
𝑛−1
• = 𝑤 𝑛: 𝑐 𝑤𝑖
𝑛
> 0
Count
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
21 of 30
WeST
Kneser-Ney smoothing equation (highest)
 Highest order calculation:
𝑃KN 𝑤 𝑛 𝑤𝑖
𝑛−1
=
max{𝑐 𝑤𝑖
𝑛
− 𝐷, 0}
𝑐 𝑤𝑖
𝑛−1
+
𝐷
𝑐 𝑤𝑖
𝑛−1
𝑁1+ 𝑤𝑖
𝑛−1
• 𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
count
Total counts
Assure positive value
Discount value
0 ≤ 𝐷 ≤ 1
Lower order probability
(recursion)
Lower order weight
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
22 of 30
WeST
Kneser-Ney smoothing equation
 Lower order calculation:
𝑃KN 𝑤 𝑛 𝑤𝑖
𝑛−1
=
max{𝑁1+ • 𝑤𝑖
𝑛
− 𝐷, 0}
𝑁1+ • 𝑤𝑖
𝑛−1
•
+
𝐷
𝑁1+ • 𝑤𝑖
𝑛−1
•
𝑁1+ 𝑤𝑖
𝑛−1
• 𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
 Lowest order calculation: 𝑃KN 𝑤 𝑛 =
𝑁1+ •𝑤𝑖
𝑛
𝑁1+ •𝑤𝑖
𝑛−1•
Continuation count
Total continuation counts
Assure positive value
Discount value
Lower order probability
(recursion)
Lower order weight
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
23 of 30
WeST
Modified Kneser-Ney smoothing [CG98]
 Different discount values for different absolute counts
 Lower order calculation:
𝑃KN 𝑤 𝑛 𝑤𝑖
𝑛−1
=
max{𝑁1+ • 𝑤𝑖
𝑛
− 𝐷(𝑐 𝑤𝑖
𝑛
), 0}
𝑁1+ • 𝑤𝑖
𝑛−1
•
+
𝐷1 𝑁1 𝑤𝑖
𝑛−1
• + 𝐷2 𝑁2 𝑤𝑖
𝑛−1
• + 𝐷3+ 𝑁3+ 𝑤𝑖
𝑛−1
•
𝑁1+ • 𝑤𝑖
𝑛−1
•
𝑃KN 𝑤 𝑛 𝑤𝑖+1
𝑛−1
 State of the art (since 15 years!)
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
24 of 30
WeST
Smoothing of GLMs
 We can use all smoothing techniques on GLMs as well!
 Small modification:
E.g: 𝑃 San | If ∗ going ∗
Lower order sequence :
– Normally: 𝑃 San | ∗ going ∗
– Instead use 𝑃 San | going ∗
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
25 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
26 of 30
WeST
Progress
 Done Yet:
 Extract text from XML files
 Building GLMs
 Kneser-Ney and modified Kneser-Ney smoothing
 Indexing with MySQL
 ToDo’s
 Finish evaluation program
 Run evaluation
 Analyze results
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
27 of 30
WeST
Content
 Introduction
 Language Models
 Generalized Language Models
 Smoothing
 Progress
 Summary
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
28 of 30
WeST
Summary
Data Sets Language Models Smoothing
• More Data
• Better Data
• Katz
• Good-Turing
• Witten-Bell
• Kneser-Ney
• …
• n-grams
• Generalized
Language
Models
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
29 of 30
WeST
Thank you for your attention!
Questions?
Martin Körner
mkoerner@uni-koblenz.de
Oberseminar 25.07.2013
30 of 30
WeST
Sources
 Images:
 Wheelchair Joystick (Slide 4):
http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg
 Smartphone Keyboard (Slide 4):
https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg
 References:
 [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing
techniques for language modeling. Technical report, Technical Report TR-10-
98, Harvard University, August, 1998.
 [JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern
Recognition in Practice, pages 381–397, 1980.
 [KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram
language modeling. In Acoustics, Speech, and Signal Processing, 1995.
ICASSP-95., 1995 International Conference on, volume 1, pages 181–184.
IEEE, 1995.

More Related Content

What's hot

Histogram processing
Histogram processingHistogram processing
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
Tharuka Vishwajith Sarathchandra
 
Generalization abstraction
Generalization abstractionGeneralization abstraction
Generalization abstraction
Edward Blurock
 
Prim's algorithm
Prim's algorithmPrim's algorithm
Prim's algorithm
Pankaj Thakur
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)
danishrafiq
 
Neural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmNeural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) Algorithm
Mostafa G. M. Mostafa
 
Single Layer Rosenblatt Perceptron
Single Layer Rosenblatt PerceptronSingle Layer Rosenblatt Perceptron
Single Layer Rosenblatt Perceptron
AndriyOleksiuk
 
Data compression
Data compressionData compression
Data compression
Sherif Abdelfattah
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
Marina Santini
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
Bipul Roy Bpl
 
Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"
Ra'Fat Al-Msie'deen
 
Unit III Knowledge Representation in AI K.Sundar,AP/CSE,VEC
Unit III  Knowledge Representation in AI   K.Sundar,AP/CSE,VECUnit III  Knowledge Representation in AI   K.Sundar,AP/CSE,VEC
Unit III Knowledge Representation in AI K.Sundar,AP/CSE,VEC
sundarKanagaraj1
 
Bellman Ford's Algorithm
Bellman Ford's AlgorithmBellman Ford's Algorithm
Bellman Ford's Algorithm
Tanmay Baranwal
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
Gael Varoquaux
 
Theory of automata and formal language
Theory of automata and formal languageTheory of automata and formal language
Theory of automata and formal language
Rabia Khalid
 
CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)
Mostafa G. M. Mostafa
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Saurabh Kaushik
 
SPATIAL FILTER
SPATIAL FILTERSPATIAL FILTER
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
ANKUSH PAL
 
Rough K Means - Numerical Example
Rough K Means - Numerical ExampleRough K Means - Numerical Example
Rough K Means - Numerical Example
Dr.E.N.Sathishkumar
 

What's hot (20)

Histogram processing
Histogram processingHistogram processing
Histogram processing
 
K Nearest Neighbor Algorithm
K Nearest Neighbor AlgorithmK Nearest Neighbor Algorithm
K Nearest Neighbor Algorithm
 
Generalization abstraction
Generalization abstractionGeneralization abstraction
Generalization abstraction
 
Prim's algorithm
Prim's algorithmPrim's algorithm
Prim's algorithm
 
Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)Compression: Video Compression (MPEG and others)
Compression: Video Compression (MPEG and others)
 
Neural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmNeural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) Algorithm
 
Single Layer Rosenblatt Perceptron
Single Layer Rosenblatt PerceptronSingle Layer Rosenblatt Perceptron
Single Layer Rosenblatt Perceptron
 
Data compression
Data compressionData compression
Data compression
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"
 
Unit III Knowledge Representation in AI K.Sundar,AP/CSE,VEC
Unit III  Knowledge Representation in AI   K.Sundar,AP/CSE,VECUnit III  Knowledge Representation in AI   K.Sundar,AP/CSE,VEC
Unit III Knowledge Representation in AI K.Sundar,AP/CSE,VEC
 
Bellman Ford's Algorithm
Bellman Ford's AlgorithmBellman Ford's Algorithm
Bellman Ford's Algorithm
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
Theory of automata and formal language
Theory of automata and formal languageTheory of automata and formal language
Theory of automata and formal language
 
CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)CSC446: Pattern Recognition (LN4)
CSC446: Pattern Recognition (LN4)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
SPATIAL FILTER
SPATIAL FILTERSPATIAL FILTER
SPATIAL FILTER
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
 
Rough K Means - Numerical Example
Rough K Means - Numerical ExampleRough K Means - Numerical Example
Rough K Means - Numerical Example
 

Similar to Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

OUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language ModelingOUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language Modeling
Florian Leitner
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
ananth
 
A guide for teachers – Years 11 and 121 23
A guide for teachers – Years 11 and 121  23 A guide for teachers – Years 11 and 121  23
A guide for teachers – Years 11 and 121 23
mecklenburgstrelitzh
 
A guide for teachers – Years 11 and 121 23 .docx
A guide for teachers – Years 11 and 121  23 .docxA guide for teachers – Years 11 and 121  23 .docx
A guide for teachers – Years 11 and 121 23 .docx
makdul
 
Unsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionUnsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projection
Deep Kayal
 
class23.ppt
class23.pptclass23.ppt
class23.ppt
AjayPratap828815
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
milkesa13
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
Khang Pham
 

Similar to Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction (8)

OUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language ModelingOUTDATED Text Mining 2/5: Language Modeling
OUTDATED Text Mining 2/5: Language Modeling
 
L05 language model_part2
L05 language model_part2L05 language model_part2
L05 language model_part2
 
A guide for teachers – Years 11 and 121 23
A guide for teachers – Years 11 and 121  23 A guide for teachers – Years 11 and 121  23
A guide for teachers – Years 11 and 121 23
 
A guide for teachers – Years 11 and 121 23 .docx
A guide for teachers – Years 11 and 121  23 .docxA guide for teachers – Years 11 and 121  23 .docx
A guide for teachers – Years 11 and 121 23 .docx
 
Unsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projectionUnsupervised sentence-embeddings by manifold approximation and projection
Unsupervised sentence-embeddings by manifold approximation and projection
 
class23.ppt
class23.pptclass23.ppt
class23.ppt
 
2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
A note on word embedding
A note on word embeddingA note on word embedding
A note on word embedding
 

Recently uploaded

Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
Fwdays
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
FilipTomaszewski5
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
christinelarrosa
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
LizaNolte
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
BibashShahi
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 

Recently uploaded (20)

Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
Northern Engraving | Modern Metal Trim, Nameplates and Appliance Panels
 
"What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w..."What does it really mean for your system to be available, or how to define w...
"What does it really mean for your system to be available, or how to define w...
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeckPoznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
Poznań ACE event - 19.06.2024 Team 24 Wrapup slidedeck
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptxPRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
PRODUCT LISTING OPTIMIZATION PRESENTATION.pptx
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillinQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Principle of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptxPrinciple of conventional tomography-Bibash Shahi ppt..pptx
Principle of conventional tomography-Bibash Shahi ppt..pptx
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 

Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction

  • 1. Web Science & Technologies University of Koblenz ▪ Landau, Germany Introduction to Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction Martin Körner Oberseminar 25.07.2013
  • 2. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 2 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  • 3. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 3 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  • 4. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 4 of 30 WeST Introduction: Motivation  Next word prediction: What is the next word a user will type?  Use cases for next word prediction:  Augmentative and Alternative Communication (AAC)  Small keyboards (Smartphones)
  • 5. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 5 of 30 WeST Introduction to next word prediction  How do we predict words? 1. Rationalist approach • Manually encoding information about language • “Toy” problems only 2. Empiricist approach • Statistical, pattern recognition, and machine learning methods applied on corpora • Result: Language models
  • 6. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 6 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  • 7. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 7 of 30 WeST Language models in general  Language model: How likely is a sentence 𝑠?  Probability distribution: 𝑃 𝑠  Calculate 𝑃 𝑠 by multiplying conditional probabilities  Example: 𝑃 If you′ re going to San Francisco , be sure … = 𝑃 you′ re | If ∗ 𝑃 going | If you′ re ∗ 𝑃 to | If you′ re going ∗ 𝑃 San | If you′ re going to ∗ 𝑃 Francisco | If you′ re going to San ∗ ⋯  Empirical approach would fail
  • 8. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 8 of 30 WeST Conditional probabilities simplified  Markov assumption [JM80]:  Only the last n-1 words are relevant for a prediction  Example with n=5: 𝑃 sure | If you′re going to San Francisco , be ≈ 𝑃 sure | San Francisco , be Counts as a word
  • 9. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 9 of 30 WeST Definitions and Markov assumption  n-gram: Sequence of length n with a count  E.g.: 5-gram: If you′re going to San 4  Sequence naming: 𝑤1 𝑖−1 ≔ 𝑤1 𝑤2 … 𝑤𝑖−1  Markov assumption formalized: 𝑃 𝑤𝑖 𝑤1 𝑖−1 ≈ 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 n-1 words
  • 10. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 10 of 30 WeST Formalizing next word prediction  Instead of 𝑃(𝑠):  Only one conditional probability 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 • Simplify 𝑃 𝑤𝑖 𝑤𝑖−𝑛+1 𝑖−1 to 𝑃 𝑤 𝑛 𝑤1 𝑛−1 NWP 𝑤1 𝑛−1 = arg max 𝑤 𝑛∈𝑊 𝑃 𝑤 𝑛 𝑤1 𝑛−1  How to calculate the probability 𝑃 𝑤 𝑛 𝑤1 𝑛−1 ? Set of all words in the corpus n-1 words n-1 words Conditional probability with Markov assumption
  • 11. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 11 of 30 WeST How to calculate 𝑃(𝑤 𝑛|𝑤1 𝑛−1 )  The easiest way:  Maximum likelihood: 𝑃ML 𝑤 𝑛 𝑤1 𝑛−1 = 𝑐(𝑤1 𝑛 ) 𝑐(𝑤1 𝑛−1 )  Example: 𝑃 San | If you′ re going to = 𝑐 If you′re going to San 𝑐 If you′re going to
  • 12. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 12 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  • 13. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 13 of 30 WeST Intro Generalized Language Models (GLMs)  Main idea:  Insert wildcard words (∗) into sequences  Example:  Instead of 𝑃 San | If you′re going to : • 𝑃 San | If ∗ ∗ ∗ • 𝑃 San | If ∗ ∗ to • 𝑃 San | If ∗ going ∗ • 𝑃 San | If ∗ going to • 𝑃 San | If you′re ∗ ∗ • …  Separate different types of GLMs based on: 1. Sequence length 2. Number of wildcard words  Aggregate results Length: 5, Wildcard words: 2
  • 14. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 14 of 30 WeST Why Generalized Language Models?  Data sparsity of n-grams  “If you′re going to San” is seen less often than for example “If ∗ ∗ to San”  Question: Does that really improve the prediction?  Result of evaluation: Yes … but we should use smoothing for language models
  • 15. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 15 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  • 16. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 16 of 30 WeST Smoothing  Problem: Unseen sequences  Try to estimate probabilities of unseen sequences  Probabilities of seen sequences need to be reduced  Two approaches: 1. Backoff smoothing 2. Interpolation smoothing
  • 17. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 17 of 30 WeST Backoff smoothing  If sequence unseen: use shorter sequence  E.g.: if 𝑃 San | going to = 0 use 𝑃 San | to 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 > 0 𝛾 ∗ 𝑃𝑏𝑎𝑐𝑘 𝑤 𝑛 𝑤𝑖+1 𝑛−1 𝑖𝑓 𝑐 𝑤𝑖 𝑛 = 0 Weight Lower order probability (recursive) Higher order probability
  • 18. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 18 of 30 WeST Interpolated Smoothing  Always use shorter sequence for calculation 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖 𝑛−1 = 𝜏 𝑤 𝑛 𝑤𝑖 𝑛−1 + 𝛾 ∗ 𝑃𝑖𝑛𝑡𝑒𝑟 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Seems to work better than backoff smoothing Higher order probability Weight Lower order probability (recursive)
  • 19. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 19 of 30 WeST Kneser-Ney smoothing [KN95] intro  Interpolated smoothing  Idea: Improve lower order calculation  Example: Word visiting unseen in corpus 𝑃 Francisco | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 Francisco 𝑃 San | visiting = 0  Normal interpolation: 0 + γ ∗ 𝑃 San Result: Francisco is as likely as San at that position Is that correct?  Difference between Francisco and San? Answer: Number of different contexts
  • 20. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 20 of 30 WeST Kneser-Ney smoothing idea  For lower order calculation:  Don’t use 𝑐 𝑤 𝑛  Instead: Number of different bigrams the word completes: 𝑁1+ • 𝑤 𝑛 ≔ 𝑤 𝑛−1: 𝑐 𝑤 𝑛−1 𝑛 > 0  Or in general: 𝑁1+ • 𝑤𝑖+1 𝑛 = 𝑤𝑖: 𝑐 𝑤𝑖 𝑛 > 0  In addition:  𝑁1+ • 𝑤𝑖+1 𝑛−1 • = 𝑤 𝑛 𝑁1+ • 𝑤𝑖+1 𝑛  𝑁1+ 𝑤𝑖 𝑛−1 • = 𝑤 𝑛: 𝑐 𝑤𝑖 𝑛 > 0 Count
  • 21. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 21 of 30 WeST Kneser-Ney smoothing equation (highest)  Highest order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑐 𝑤𝑖 𝑛 − 𝐷, 0} 𝑐 𝑤𝑖 𝑛−1 + 𝐷 𝑐 𝑤𝑖 𝑛−1 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1 count Total counts Assure positive value Discount value 0 ≤ 𝐷 ≤ 1 Lower order probability (recursion) Lower order weight
  • 22. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 22 of 30 WeST Kneser-Ney smoothing equation  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷, 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑁1+ 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  Lowest order calculation: 𝑃KN 𝑤 𝑛 = 𝑁1+ •𝑤𝑖 𝑛 𝑁1+ •𝑤𝑖 𝑛−1• Continuation count Total continuation counts Assure positive value Discount value Lower order probability (recursion) Lower order weight
  • 23. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 23 of 30 WeST Modified Kneser-Ney smoothing [CG98]  Different discount values for different absolute counts  Lower order calculation: 𝑃KN 𝑤 𝑛 𝑤𝑖 𝑛−1 = max{𝑁1+ • 𝑤𝑖 𝑛 − 𝐷(𝑐 𝑤𝑖 𝑛 ), 0} 𝑁1+ • 𝑤𝑖 𝑛−1 • + 𝐷1 𝑁1 𝑤𝑖 𝑛−1 • + 𝐷2 𝑁2 𝑤𝑖 𝑛−1 • + 𝐷3+ 𝑁3+ 𝑤𝑖 𝑛−1 • 𝑁1+ • 𝑤𝑖 𝑛−1 • 𝑃KN 𝑤 𝑛 𝑤𝑖+1 𝑛−1  State of the art (since 15 years!)
  • 24. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 24 of 30 WeST Smoothing of GLMs  We can use all smoothing techniques on GLMs as well!  Small modification: E.g: 𝑃 San | If ∗ going ∗ Lower order sequence : – Normally: 𝑃 San | ∗ going ∗ – Instead use 𝑃 San | going ∗
  • 25. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 25 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  • 26. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 26 of 30 WeST Progress  Done Yet:  Extract text from XML files  Building GLMs  Kneser-Ney and modified Kneser-Ney smoothing  Indexing with MySQL  ToDo’s  Finish evaluation program  Run evaluation  Analyze results
  • 27. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 27 of 30 WeST Content  Introduction  Language Models  Generalized Language Models  Smoothing  Progress  Summary
  • 28. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 28 of 30 WeST Summary Data Sets Language Models Smoothing • More Data • Better Data • Katz • Good-Turing • Witten-Bell • Kneser-Ney • … • n-grams • Generalized Language Models
  • 29. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 29 of 30 WeST Thank you for your attention! Questions?
  • 30. Martin Körner mkoerner@uni-koblenz.de Oberseminar 25.07.2013 30 of 30 WeST Sources  Images:  Wheelchair Joystick (Slide 4): http://i01.i.aliimg.com/img/pb/741/422/527/527422741_355.jpg  Smartphone Keyboard (Slide 4): https://activecaptain.com/articles/mobilePhones/iPhone/iPhone_Keyboard.jpg  References:  [CG98]: Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical report, Technical Report TR-10- 98, Harvard University, August, 1998.  [JM80]: F. Jelinek and R.L. Mercer. Interpolated estimation of markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice, pages 381–397, 1980.  [KN95]: Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1, pages 181–184. IEEE, 1995.