SlideShare a Scribd company logo
1 of 33
Download to read offline
A Bayesian Interpretation of Interpolated
Kneser-Ney
CS 249
Elaine Lin, Satya Reddy
Outline
A. Purpose of the paper
B. Background
C. Smoothing Techniques
• Interpolated Kneser-Ney
D. Pitman-Yor Process
E. Hierarchical Pitman-Yor Language Models
F. Experimental Results
G. Discussion
Motivation
Related to previously proposed NP BM with
Pitman-Yor
• Interpolated Kneser-Ney is one of the best smoothing
techniques for n-gram LM
• Authors previously proposed a hierarchical Bayesian LM on
Pitman-Yor process
• Attempt to show this model recovers exactly the formulation
of interpolated KN
Some background
The issue at hand
N-gram models
● Assuming a 3-gram model with O(50000) word vocabulary, we need 500003
~ 1014
parameters
● Maximum Likelihood Estimation (MLE)
○ For any unseen data, MLE will give a probability of 0
○ This results in overfitting
MLE in a trigram model
Overfitting Intuition: from an ML perspective
● Large number of parameters result in a higher-order polynomial that only
does well on training data, very poorly on test data
● We need a technique to lower the variance of the learning algorithm
● Smoothing creates an approximate function that captures important patterns
and leaves out noise or other fine-scale structures
Smoothing Techniques
Absolute-discounting interpolation - Intuition
Idea: Use a linear interpolation of lower-order and higher-order n-grams to
relocate probability mass
• When the higher-order model matches strongly, the second lower-order term
has little weight
• Otherwise, the first term (discounted relative bigram count) is near 0, the
lower-order model dominates
Bigram example
δ: a fixed discount value
α: a normalizing constant
Absolute-discounting interpolation - Problem
A common example
● Suppose the phrase San Francisco shows up many times in training corpus
○ P(Francisco) will likely be high
● Let’s predict: I can’t see without my reading ___.
○ We might have seen a lot more San Francisco than glasses
○ Pabs
(Francisco) > Pabs
(glasses)
• Using the absolute discounting interpolation, c(reading Francisco) will be low, and
so the second term, c(Francisco), will dominate. We might end up predicting the
blank to be Francisco!
Interpolated Kneser-Ney
Makes use of the absolute-discounting model, with some tweaks
● Asks a slightly harder question of the lower-order model:
○ Continuing with our bigram example from before, we ask how likely a word
wi
is to appear in an unfamiliar bigram context
○ Instead of simply asking how likely a word wi
is to appear
# of bigrams wi
completes
# of all bigram types
Variant: Modified Kneser-Ney
Use different values of discounts for different counts
● One discount value each for count = 1, 2, …, cmax
-1
● Another discount for count >= cmax
● Works slightly better than interpolated KN
● Much more expensive and complexity from the extra number of parameters
needed to search
○ Goodman(1998) uses cmax
= 3 as a compromise
Pitman-Yor Process - A high-level view
Chinese Restaurant Process
● A random process in which n customers sit down in a Chinese restaurant with an
infinite number of tables
○ First customer sits at the first table
○ m-th subsequent customer sits at a table drawn from the following
distribution
■ ni
is the number of customers currently at table i
■ Fm-1
denotes the state of the restaurant after m-1 customers have been
seated
■ α denotes concentration parameter (inverse to degree of discretization)
Pitman-Yor Process - A high-level view
Chinese Restaurant Process
Pitman-Yor Process - A high-level view
Chinese Restaurant Process (cont)
● Customers ~ Data points
● Tables ~ Clusters
● Exchangeability
○ The distribution for a draw xi
is invariant to the order of the sequence
(x1
, x2
, …, xi-1
)
● Nonparametric Bayes
○ The number of occupied tables grows (roughly) as O(log n)
○ “Non-parametric”
■ the number of parameters grows with data size
○ Imagine doing a k-means clustering except k is not given
○ “Bayes”
■ More detail later
Pitman-Yor Process - A high-level view
Chinese Restaurant Process (cont)
● This is the Dirichlet Process
● Assumes the same probability for discovering a
new cluster as data size grows to infinity
● This tail behavior does not follow power law
distribution
○ Pitman-Yor improves on this
-d
+dt
d: discount parameter
t: # of occupied tables
Hierarchical Bayesian Models
Pitman-Yor Process
Who scores most ? Messi Vs Ronaldo
● Scenario 1: ngoals2018
(messi) = 15 , ngoals2018
(Ronaldo) = 20
○ Probability that X scores most in 2018: P2018
(X) ∝ ngoals2018
(X)
● Scenario 2: In addition, shoot accuracy sa2018
(messi) = .8 , sa2018
(Ronaldo) = .5
○ Probability that X scores most in 2018: P2018
(X) ∝ ngoals2018
(X).sa2018
(X)
● Scenario 3: But what if san
(X) ∝ Pn-1
(X) ?
○ ?
Scenario 1 -> Frequentist Model
Scenario 2 -> Bayesian Model
Scenario 3 -> Hierarchical Bayesian Model
Hierarchical Bayesian Models
● Statistical model written in multiple levels
(hierarchical form) that estimates the parameters
of the posterior distribution using the Bayesian
method.
● The sub-models combine to form the hierarchical
model.
● Bayes theorem is used to integrate them with the
observed data
Bayesian Language Models
● Let D be the data and Q be the parameters(n-gram probabilities here) and H be
some distribution.
● In Max-Likelihood models, we find Q that maximizes P(D|Q, H).
● But Bayes theorem gives us the posterior probability of the parameters Q as
P(Q|D,H) ∝ P(D|Q,H).P(Q|H)
Posterior ∝ Likelihood. Prior
● Since the Likelihood is a multinomial, the prior will be Dirichlet since it is
conjugate
Dirichlet, Hierarchical Dirichlet, Pitman-Yor
● A process is a - ‘Distribution over Distributions’
● The draws from a DP are random distributions that have certain properties.
○ ‘Base Distribution’ parameter just like mean in Normal Distribution
○ DP is non-parametric
● Hierarchical Dirichlet Process : Similar to Hierarchical Models
Pitman Yor Process
● It is also a distribution over distributions.
● But, sample draws from the G1= PY(d,ϴ,Go
) process gives a infinite discrete
probability distribution, consisting of
○ An infinite set of atoms drawn from G0
.
○ Weights drawn from a two-parameter Poisson–Dirichlet distribution.
■ It is a probability distribution on the set of decreasing positive
sequences with sum 1.
● A draw from G1 looks like
Chinese Restaurant Process
● x1,x2,... drawn from G1, Conditional distribution of the next draw after a
sequence of c
ck
is the number of customers sitting at table k
is the strength parameter.
t is the number of occupied tables.
is the point mass located at
● Some(De Finetti’s) theorem
○ Exchangeable sequences shows that there must be a distribution over
distributions G1 such that x1, x2, . . . are conditionally independent and
identical draws from G1.
● The Pitman-Yor process is one such distribution over G1.
Pitman Yor Process in Word Distributions
● Consider
○ Pitman Yor process as prior for unigram word distributions.
○ Base Distribution, G0
= Uniform over vocabulary.
● We Model the desired unigram distribution over words as the draw from the
Pitman-Yor process G1
Analogy with CRP
● cw
occurrences of word w ∈ W : cw
customers are eating dish w in the Chinese
restaurant representation
● But CRP has infinite tables and infinite capacity :/
○ tw
be the number of tables serving dish w.
● The predictive probability of a new word given the seating arrangement
● tw
=1 ? Absolute Discount
● Additive term can be understood as interpolation with the uniform distribution.
Language Modelling using Hierarchical PY
● Given
○ A context u consisting of a sequence of up to n − 1 words
● Gu
(w) be the distribution over the current word w.
● Use a Pitman-Yor process as the prior for Gu
(w).
● (u) is the suffix of u consisting of all but the first word.
Similarities with Interpolated KN
● The predictive probability of the next word after context u given the seating
arrangement is
● Suppose that
● Surprise ! The equation directly reduces to the predictive probabilities given by
interpolated Kneser-Ney
● Strength and discount parameters depend on the length of the context.
Experimental Results
Details
● Data
○ 16 million word corpus from APNews
○ 1 million words from WSJ dataset
● ?-grams
○ Tri-gram for APNews data
○ Bi-gram for WSJ data
● Models
○ HPY
○ M-KN (cmax
= 2,3)
○ I-KN
Experiment 1
● IKN is just an approximation inference
scheme in the HPYlanguage model
Experiment 2
● Average discount grows as a
power-law for HPY
Experiment 3
● Model is only sensitive to d1 but is insensitive to d0, θ0 and θ1
Discussion
Thank You
33

More Related Content

What's hot

Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural NetworksESCOM
 
A Generalization of QN-Maps
A Generalization of QN-MapsA Generalization of QN-Maps
A Generalization of QN-MapsIOSR Journals
 
Linear Size Meshes
Linear Size MeshesLinear Size Meshes
Linear Size MeshesDon Sheehy
 
Unifying Nearest Neighbors Collaborative Filtering
Unifying Nearest Neighbors Collaborative FilteringUnifying Nearest Neighbors Collaborative Filtering
Unifying Nearest Neighbors Collaborative Filteringkoeverstrep
 
Csr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatovCsr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatovCSR2011
 
6.6 normal approx p hat
6.6 normal approx p hat6.6 normal approx p hat
6.6 normal approx p hatleblance
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
20191123 bayes dl-jp
20191123 bayes dl-jp20191123 bayes dl-jp
20191123 bayes dl-jpTaku Yoshioka
 
Threshold network models
Threshold network modelsThreshold network models
Threshold network modelsNaoki Masuda
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
Variational Inference
Variational InferenceVariational Inference
Variational InferenceTushar Tank
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-DivergencesFrank Nielsen
 

What's hot (15)

Multilayer Neural Networks
Multilayer Neural NetworksMultilayer Neural Networks
Multilayer Neural Networks
 
Unit 3
Unit 3Unit 3
Unit 3
 
A Generalization of QN-Maps
A Generalization of QN-MapsA Generalization of QN-Maps
A Generalization of QN-Maps
 
Linear Size Meshes
Linear Size MeshesLinear Size Meshes
Linear Size Meshes
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Unifying Nearest Neighbors Collaborative Filtering
Unifying Nearest Neighbors Collaborative FilteringUnifying Nearest Neighbors Collaborative Filtering
Unifying Nearest Neighbors Collaborative Filtering
 
Bn
BnBn
Bn
 
Csr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatovCsr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatov
 
6.6 normal approx p hat
6.6 normal approx p hat6.6 normal approx p hat
6.6 normal approx p hat
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
20191123 bayes dl-jp
20191123 bayes dl-jp20191123 bayes dl-jp
20191123 bayes dl-jp
 
Threshold network models
Threshold network modelsThreshold network models
Threshold network models
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
Variational Inference
Variational InferenceVariational Inference
Variational Inference
 
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 On Clustering Histograms with k-Means by Using Mixed α-Divergences On Clustering Histograms with k-Means by Using Mixed α-Divergences
On Clustering Histograms with k-Means by Using Mixed α-Divergences
 

Similar to A bayesian interpretation of interpolated kneser ney

Neural Network Approximation.pdf
Neural Network Approximation.pdfNeural Network Approximation.pdf
Neural Network Approximation.pdfbvhrs2
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelJunya Tanaka
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdfppvijith
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Kernel estimation(ref)
Kernel estimation(ref)Kernel estimation(ref)
Kernel estimation(ref)Zahra Amini
 
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02mansab MIRZA
 
Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptxSunilWork1
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionFlavio Morelli
 
NIPS2007: learning using many examples
NIPS2007: learning using many examplesNIPS2007: learning using many examples
NIPS2007: learning using many exampleszukun
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shangBBKuhn
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmSupun Abeysinghe
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewChia-Ching Lin
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
 

Similar to A bayesian interpretation of interpolated kneser ney (20)

Neural Network Approximation.pdf
Neural Network Approximation.pdfNeural Network Approximation.pdf
Neural Network Approximation.pdf
 
Lecture 8
Lecture 8Lecture 8
Lecture 8
 
Learn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic ModelLearn from Example and Learn Probabilistic Model
Learn from Example and Learn Probabilistic Model
 
Regularization
RegularizationRegularization
Regularization
 
Machine learning using matlab.pdf
Machine learning using matlab.pdfMachine learning using matlab.pdf
Machine learning using matlab.pdf
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Kernel estimation(ref)
Kernel estimation(ref)Kernel estimation(ref)
Kernel estimation(ref)
 
Unit 5
Unit 5Unit 5
Unit 5
 
Unit 5
Unit 5Unit 5
Unit 5
 
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02
 
Discrete Math Lecture 02: First Order Logic
Discrete Math Lecture 02: First Order LogicDiscrete Math Lecture 02: First Order Logic
Discrete Math Lecture 02: First Order Logic
 
Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptx
 
Variational Bayes: A Gentle Introduction
Variational Bayes: A Gentle IntroductionVariational Bayes: A Gentle Introduction
Variational Bayes: A Gentle Introduction
 
ML using MATLAB
ML using MATLABML using MATLAB
ML using MATLAB
 
NIPS2007: learning using many examples
NIPS2007: learning using many examplesNIPS2007: learning using many examples
NIPS2007: learning using many examples
 
Md2k 0219 shang
Md2k 0219 shangMd2k 0219 shang
Md2k 0219 shang
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical View
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
 

Recently uploaded

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfadityarao40181
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 

Recently uploaded (20)

OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Biting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdfBiting mechanism of poisonous snakes.pdf
Biting mechanism of poisonous snakes.pdf
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 

A bayesian interpretation of interpolated kneser ney

  • 1. A Bayesian Interpretation of Interpolated Kneser-Ney CS 249 Elaine Lin, Satya Reddy
  • 2. Outline A. Purpose of the paper B. Background C. Smoothing Techniques • Interpolated Kneser-Ney D. Pitman-Yor Process E. Hierarchical Pitman-Yor Language Models F. Experimental Results G. Discussion
  • 3. Motivation Related to previously proposed NP BM with Pitman-Yor • Interpolated Kneser-Ney is one of the best smoothing techniques for n-gram LM • Authors previously proposed a hierarchical Bayesian LM on Pitman-Yor process • Attempt to show this model recovers exactly the formulation of interpolated KN
  • 5. The issue at hand N-gram models ● Assuming a 3-gram model with O(50000) word vocabulary, we need 500003 ~ 1014 parameters ● Maximum Likelihood Estimation (MLE) ○ For any unseen data, MLE will give a probability of 0 ○ This results in overfitting MLE in a trigram model
  • 6. Overfitting Intuition: from an ML perspective ● Large number of parameters result in a higher-order polynomial that only does well on training data, very poorly on test data ● We need a technique to lower the variance of the learning algorithm ● Smoothing creates an approximate function that captures important patterns and leaves out noise or other fine-scale structures
  • 8. Absolute-discounting interpolation - Intuition Idea: Use a linear interpolation of lower-order and higher-order n-grams to relocate probability mass • When the higher-order model matches strongly, the second lower-order term has little weight • Otherwise, the first term (discounted relative bigram count) is near 0, the lower-order model dominates Bigram example δ: a fixed discount value α: a normalizing constant
  • 9. Absolute-discounting interpolation - Problem A common example ● Suppose the phrase San Francisco shows up many times in training corpus ○ P(Francisco) will likely be high ● Let’s predict: I can’t see without my reading ___. ○ We might have seen a lot more San Francisco than glasses ○ Pabs (Francisco) > Pabs (glasses) • Using the absolute discounting interpolation, c(reading Francisco) will be low, and so the second term, c(Francisco), will dominate. We might end up predicting the blank to be Francisco!
  • 10. Interpolated Kneser-Ney Makes use of the absolute-discounting model, with some tweaks ● Asks a slightly harder question of the lower-order model: ○ Continuing with our bigram example from before, we ask how likely a word wi is to appear in an unfamiliar bigram context ○ Instead of simply asking how likely a word wi is to appear # of bigrams wi completes # of all bigram types
  • 11. Variant: Modified Kneser-Ney Use different values of discounts for different counts ● One discount value each for count = 1, 2, …, cmax -1 ● Another discount for count >= cmax ● Works slightly better than interpolated KN ● Much more expensive and complexity from the extra number of parameters needed to search ○ Goodman(1998) uses cmax = 3 as a compromise
  • 12. Pitman-Yor Process - A high-level view Chinese Restaurant Process ● A random process in which n customers sit down in a Chinese restaurant with an infinite number of tables ○ First customer sits at the first table ○ m-th subsequent customer sits at a table drawn from the following distribution ■ ni is the number of customers currently at table i ■ Fm-1 denotes the state of the restaurant after m-1 customers have been seated ■ α denotes concentration parameter (inverse to degree of discretization)
  • 13. Pitman-Yor Process - A high-level view Chinese Restaurant Process
  • 14. Pitman-Yor Process - A high-level view Chinese Restaurant Process (cont) ● Customers ~ Data points ● Tables ~ Clusters ● Exchangeability ○ The distribution for a draw xi is invariant to the order of the sequence (x1 , x2 , …, xi-1 ) ● Nonparametric Bayes ○ The number of occupied tables grows (roughly) as O(log n) ○ “Non-parametric” ■ the number of parameters grows with data size ○ Imagine doing a k-means clustering except k is not given ○ “Bayes” ■ More detail later
  • 15. Pitman-Yor Process - A high-level view Chinese Restaurant Process (cont) ● This is the Dirichlet Process ● Assumes the same probability for discovering a new cluster as data size grows to infinity ● This tail behavior does not follow power law distribution ○ Pitman-Yor improves on this -d +dt d: discount parameter t: # of occupied tables
  • 17. Who scores most ? Messi Vs Ronaldo ● Scenario 1: ngoals2018 (messi) = 15 , ngoals2018 (Ronaldo) = 20 ○ Probability that X scores most in 2018: P2018 (X) ∝ ngoals2018 (X) ● Scenario 2: In addition, shoot accuracy sa2018 (messi) = .8 , sa2018 (Ronaldo) = .5 ○ Probability that X scores most in 2018: P2018 (X) ∝ ngoals2018 (X).sa2018 (X) ● Scenario 3: But what if san (X) ∝ Pn-1 (X) ? ○ ? Scenario 1 -> Frequentist Model Scenario 2 -> Bayesian Model Scenario 3 -> Hierarchical Bayesian Model
  • 18. Hierarchical Bayesian Models ● Statistical model written in multiple levels (hierarchical form) that estimates the parameters of the posterior distribution using the Bayesian method. ● The sub-models combine to form the hierarchical model. ● Bayes theorem is used to integrate them with the observed data
  • 19. Bayesian Language Models ● Let D be the data and Q be the parameters(n-gram probabilities here) and H be some distribution. ● In Max-Likelihood models, we find Q that maximizes P(D|Q, H). ● But Bayes theorem gives us the posterior probability of the parameters Q as P(Q|D,H) ∝ P(D|Q,H).P(Q|H) Posterior ∝ Likelihood. Prior ● Since the Likelihood is a multinomial, the prior will be Dirichlet since it is conjugate
  • 20. Dirichlet, Hierarchical Dirichlet, Pitman-Yor ● A process is a - ‘Distribution over Distributions’ ● The draws from a DP are random distributions that have certain properties. ○ ‘Base Distribution’ parameter just like mean in Normal Distribution ○ DP is non-parametric ● Hierarchical Dirichlet Process : Similar to Hierarchical Models
  • 21. Pitman Yor Process ● It is also a distribution over distributions. ● But, sample draws from the G1= PY(d,ϴ,Go ) process gives a infinite discrete probability distribution, consisting of ○ An infinite set of atoms drawn from G0 . ○ Weights drawn from a two-parameter Poisson–Dirichlet distribution. ■ It is a probability distribution on the set of decreasing positive sequences with sum 1. ● A draw from G1 looks like
  • 22. Chinese Restaurant Process ● x1,x2,... drawn from G1, Conditional distribution of the next draw after a sequence of c ck is the number of customers sitting at table k is the strength parameter. t is the number of occupied tables. is the point mass located at ● Some(De Finetti’s) theorem ○ Exchangeable sequences shows that there must be a distribution over distributions G1 such that x1, x2, . . . are conditionally independent and identical draws from G1. ● The Pitman-Yor process is one such distribution over G1.
  • 23. Pitman Yor Process in Word Distributions ● Consider ○ Pitman Yor process as prior for unigram word distributions. ○ Base Distribution, G0 = Uniform over vocabulary. ● We Model the desired unigram distribution over words as the draw from the Pitman-Yor process G1
  • 24. Analogy with CRP ● cw occurrences of word w ∈ W : cw customers are eating dish w in the Chinese restaurant representation ● But CRP has infinite tables and infinite capacity :/ ○ tw be the number of tables serving dish w. ● The predictive probability of a new word given the seating arrangement ● tw =1 ? Absolute Discount ● Additive term can be understood as interpolation with the uniform distribution.
  • 25. Language Modelling using Hierarchical PY ● Given ○ A context u consisting of a sequence of up to n − 1 words ● Gu (w) be the distribution over the current word w. ● Use a Pitman-Yor process as the prior for Gu (w). ● (u) is the suffix of u consisting of all but the first word.
  • 26. Similarities with Interpolated KN ● The predictive probability of the next word after context u given the seating arrangement is ● Suppose that ● Surprise ! The equation directly reduces to the predictive probabilities given by interpolated Kneser-Ney ● Strength and discount parameters depend on the length of the context.
  • 28. Details ● Data ○ 16 million word corpus from APNews ○ 1 million words from WSJ dataset ● ?-grams ○ Tri-gram for APNews data ○ Bi-gram for WSJ data ● Models ○ HPY ○ M-KN (cmax = 2,3) ○ I-KN
  • 29. Experiment 1 ● IKN is just an approximation inference scheme in the HPYlanguage model
  • 30. Experiment 2 ● Average discount grows as a power-law for HPY
  • 31. Experiment 3 ● Model is only sensitive to d1 but is insensitive to d0, θ0 and θ1