A bayesian interpretation of interpolated kneser ney

A Bayesian Interpretation of Interpolated
Kneser-Ney
CS 249
Elaine Lin, Satya Reddy

Outline
A. Purpose of the paper
B. Background
C. Smoothing Techniques
• Interpolated Kneser-Ney
D. Pitman-Yor Process
E. Hierarchical Pitman-Yor Language Models
F. Experimental Results
G. Discussion

Motivation
Related to previously proposed NP BM with
Pitman-Yor
• Interpolated Kneser-Ney is one of the best smoothing
techniques for n-gram LM
• Authors previously proposed a hierarchical Bayesian LM on
Pitman-Yor process
• Attempt to show this model recovers exactly the formulation
of interpolated KN

The issue at hand
N-gram models
● Assuming a 3-gram model with O(50000) word vocabulary, we need 500003
~ 1014
parameters
● Maximum Likelihood Estimation (MLE)
○ For any unseen data, MLE will give a probability of 0
○ This results in overfitting
MLE in a trigram model

Overfitting Intuition: from an ML perspective
● Large number of parameters result in a higher-order polynomial that only
does well on training data, very poorly on test data
● We need a technique to lower the variance of the learning algorithm
● Smoothing creates an approximate function that captures important patterns
and leaves out noise or other fine-scale structures

Absolute-discounting interpolation - Intuition
Idea: Use a linear interpolation of lower-order and higher-order n-grams to
relocate probability mass
• When the higher-order model matches strongly, the second lower-order term
has little weight
• Otherwise, the first term (discounted relative bigram count) is near 0, the
lower-order model dominates
Bigram example
δ: a fixed discount value
α: a normalizing constant

Absolute-discounting interpolation - Problem
A common example
● Suppose the phrase San Francisco shows up many times in training corpus
○ P(Francisco) will likely be high
● Let’s predict: I can’t see without my reading ___.
○ We might have seen a lot more San Francisco than glasses
○ Pabs
(Francisco) > Pabs
(glasses)
• Using the absolute discounting interpolation, c(reading Francisco) will be low, and
so the second term, c(Francisco), will dominate. We might end up predicting the
blank to be Francisco!

Interpolated Kneser-Ney
Makes use of the absolute-discounting model, with some tweaks
● Asks a slightly harder question of the lower-order model:
○ Continuing with our bigram example from before, we ask how likely a word
wi
is to appear in an unfamiliar bigram context
○ Instead of simply asking how likely a word wi
is to appear
# of bigrams wi
completes
# of all bigram types

Variant: Modified Kneser-Ney
Use different values of discounts for different counts
● One discount value each for count = 1, 2, …, cmax
-1
● Another discount for count >= cmax
● Works slightly better than interpolated KN
● Much more expensive and complexity from the extra number of parameters
needed to search
○ Goodman(1998) uses cmax
= 3 as a compromise

Pitman-Yor Process - A high-level view
Chinese Restaurant Process
● A random process in which n customers sit down in a Chinese restaurant with an
infinite number of tables
○ First customer sits at the first table
○ m-th subsequent customer sits at a table drawn from the following
distribution
■ ni
is the number of customers currently at table i
■ Fm-1
denotes the state of the restaurant after m-1 customers have been
seated
■ α denotes concentration parameter (inverse to degree of discretization)

Chinese Restaurant Process (cont)
● Customers ~ Data points
● Tables ~ Clusters
● Exchangeability
○ The distribution for a draw xi
is invariant to the order of the sequence
(x1
, x2
, …, xi-1
)
● Nonparametric Bayes
○ The number of occupied tables grows (roughly) as O(log n)
○ “Non-parametric”
■ the number of parameters grows with data size
○ Imagine doing a k-means clustering except k is not given
○ “Bayes”
■ More detail later

Chinese Restaurant Process (cont)
● This is the Dirichlet Process
● Assumes the same probability for discovering a
new cluster as data size grows to infinity
● This tail behavior does not follow power law
distribution
○ Pitman-Yor improves on this
-d
+dt
d: discount parameter
t: # of occupied tables

Hierarchical Bayesian Models
Pitman-Yor Process

Who scores most ? Messi Vs Ronaldo
● Scenario 1: ngoals2018
(messi) = 15 , ngoals2018
(Ronaldo) = 20
○ Probability that X scores most in 2018: P2018
(X) ∝ ngoals2018
(X)
● Scenario 2: In addition, shoot accuracy sa2018
(messi) = .8 , sa2018
(Ronaldo) = .5
○ Probability that X scores most in 2018: P2018
(X) ∝ ngoals2018
(X).sa2018
(X)
● Scenario 3: But what if san
(X) ∝ Pn-1
(X) ?
○ ?
Scenario 1 -> Frequentist Model
Scenario 2 -> Bayesian Model
Scenario 3 -> Hierarchical Bayesian Model

Hierarchical Bayesian Models
● Statistical model written in multiple levels
(hierarchical form) that estimates the parameters
of the posterior distribution using the Bayesian
method.
● The sub-models combine to form the hierarchical
model.
● Bayes theorem is used to integrate them with the
observed data

Bayesian Language Models
● Let D be the data and Q be the parameters(n-gram probabilities here) and H be
some distribution.
● In Max-Likelihood models, we find Q that maximizes P(D|Q, H).
● But Bayes theorem gives us the posterior probability of the parameters Q as
P(Q|D,H) ∝ P(D|Q,H).P(Q|H)
Posterior ∝ Likelihood. Prior
● Since the Likelihood is a multinomial, the prior will be Dirichlet since it is
conjugate

Dirichlet, Hierarchical Dirichlet, Pitman-Yor
● A process is a - ‘Distribution over Distributions’
● The draws from a DP are random distributions that have certain properties.
○ ‘Base Distribution’ parameter just like mean in Normal Distribution
○ DP is non-parametric
● Hierarchical Dirichlet Process : Similar to Hierarchical Models

Pitman Yor Process
● It is also a distribution over distributions.
● But, sample draws from the G1= PY(d,ϴ,Go
) process gives a infinite discrete
probability distribution, consisting of
○ An infinite set of atoms drawn from G0
.
○ Weights drawn from a two-parameter Poisson–Dirichlet distribution.
■ It is a probability distribution on the set of decreasing positive
sequences with sum 1.
● A draw from G1 looks like

● x1,x2,... drawn from G1, Conditional distribution of the next draw after a
sequence of c
ck
is the number of customers sitting at table k
is the strength parameter.
t is the number of occupied tables.
is the point mass located at
● Some(De Finetti’s) theorem
○ Exchangeable sequences shows that there must be a distribution over
distributions G1 such that x1, x2, . . . are conditionally independent and
identical draws from G1.
● The Pitman-Yor process is one such distribution over G1.

Pitman Yor Process in Word Distributions
● Consider
○ Pitman Yor process as prior for unigram word distributions.
○ Base Distribution, G0
= Uniform over vocabulary.
● We Model the desired unigram distribution over words as the draw from the
Pitman-Yor process G1

Analogy with CRP
● cw
occurrences of word w ∈ W : cw
customers are eating dish w in the Chinese
restaurant representation
● But CRP has infinite tables and infinite capacity :/
○ tw
be the number of tables serving dish w.
● The predictive probability of a new word given the seating arrangement
● tw
=1 ? Absolute Discount
● Additive term can be understood as interpolation with the uniform distribution.

Language Modelling using Hierarchical PY
● Given
○ A context u consisting of a sequence of up to n − 1 words
● Gu
(w) be the distribution over the current word w.
● Use a Pitman-Yor process as the prior for Gu
(w).
● (u) is the suffix of u consisting of all but the first word.

Similarities with Interpolated KN
● The predictive probability of the next word after context u given the seating
arrangement is
● Suppose that
● Surprise ! The equation directly reduces to the predictive probabilities given by
interpolated Kneser-Ney
● Strength and discount parameters depend on the length of the context.

Details
● Data
○ 16 million word corpus from APNews
○ 1 million words from WSJ dataset
● ?-grams
○ Tri-gram for APNews data
○ Bi-gram for WSJ data
● Models
○ HPY
○ M-KN (cmax
= 2,3)
○ I-KN

Experiment 1
● IKN is just an approximation inference
scheme in the HPYlanguage model

Experiment 2
● Average discount grows as a
power-law for HPY

Experiment 3
● Model is only sensitive to d1 but is insensitive to d0, θ0 and θ1

A bayesian interpretation of interpolated kneser ney

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to A bayesian interpretation of interpolated kneser ney

Similar to A bayesian interpretation of interpolated kneser ney (20)

Recently uploaded

Recently uploaded (20)

A bayesian interpretation of interpolated kneser ney