2. Outline
A. Purpose of the paper
B. Background
C. Smoothing Techniques
• Interpolated Kneser-Ney
D. Pitman-Yor Process
E. Hierarchical Pitman-Yor Language Models
F. Experimental Results
G. Discussion
3. Motivation
Related to previously proposed NP BM with
Pitman-Yor
• Interpolated Kneser-Ney is one of the best smoothing
techniques for n-gram LM
• Authors previously proposed a hierarchical Bayesian LM on
Pitman-Yor process
• Attempt to show this model recovers exactly the formulation
of interpolated KN
5. The issue at hand
N-gram models
● Assuming a 3-gram model with O(50000) word vocabulary, we need 500003
~ 1014
parameters
● Maximum Likelihood Estimation (MLE)
○ For any unseen data, MLE will give a probability of 0
○ This results in overfitting
MLE in a trigram model
6. Overfitting Intuition: from an ML perspective
● Large number of parameters result in a higher-order polynomial that only
does well on training data, very poorly on test data
● We need a technique to lower the variance of the learning algorithm
● Smoothing creates an approximate function that captures important patterns
and leaves out noise or other fine-scale structures
8. Absolute-discounting interpolation - Intuition
Idea: Use a linear interpolation of lower-order and higher-order n-grams to
relocate probability mass
• When the higher-order model matches strongly, the second lower-order term
has little weight
• Otherwise, the first term (discounted relative bigram count) is near 0, the
lower-order model dominates
Bigram example
δ: a fixed discount value
α: a normalizing constant
9. Absolute-discounting interpolation - Problem
A common example
● Suppose the phrase San Francisco shows up many times in training corpus
○ P(Francisco) will likely be high
● Let’s predict: I can’t see without my reading ___.
○ We might have seen a lot more San Francisco than glasses
○ Pabs
(Francisco) > Pabs
(glasses)
• Using the absolute discounting interpolation, c(reading Francisco) will be low, and
so the second term, c(Francisco), will dominate. We might end up predicting the
blank to be Francisco!
10. Interpolated Kneser-Ney
Makes use of the absolute-discounting model, with some tweaks
● Asks a slightly harder question of the lower-order model:
○ Continuing with our bigram example from before, we ask how likely a word
wi
is to appear in an unfamiliar bigram context
○ Instead of simply asking how likely a word wi
is to appear
# of bigrams wi
completes
# of all bigram types
11. Variant: Modified Kneser-Ney
Use different values of discounts for different counts
● One discount value each for count = 1, 2, …, cmax
-1
● Another discount for count >= cmax
● Works slightly better than interpolated KN
● Much more expensive and complexity from the extra number of parameters
needed to search
○ Goodman(1998) uses cmax
= 3 as a compromise
12. Pitman-Yor Process - A high-level view
Chinese Restaurant Process
● A random process in which n customers sit down in a Chinese restaurant with an
infinite number of tables
○ First customer sits at the first table
○ m-th subsequent customer sits at a table drawn from the following
distribution
■ ni
is the number of customers currently at table i
■ Fm-1
denotes the state of the restaurant after m-1 customers have been
seated
■ α denotes concentration parameter (inverse to degree of discretization)
14. Pitman-Yor Process - A high-level view
Chinese Restaurant Process (cont)
● Customers ~ Data points
● Tables ~ Clusters
● Exchangeability
○ The distribution for a draw xi
is invariant to the order of the sequence
(x1
, x2
, …, xi-1
)
● Nonparametric Bayes
○ The number of occupied tables grows (roughly) as O(log n)
○ “Non-parametric”
■ the number of parameters grows with data size
○ Imagine doing a k-means clustering except k is not given
○ “Bayes”
■ More detail later
15. Pitman-Yor Process - A high-level view
Chinese Restaurant Process (cont)
● This is the Dirichlet Process
● Assumes the same probability for discovering a
new cluster as data size grows to infinity
● This tail behavior does not follow power law
distribution
○ Pitman-Yor improves on this
-d
+dt
d: discount parameter
t: # of occupied tables
17. Who scores most ? Messi Vs Ronaldo
● Scenario 1: ngoals2018
(messi) = 15 , ngoals2018
(Ronaldo) = 20
○ Probability that X scores most in 2018: P2018
(X) ∝ ngoals2018
(X)
● Scenario 2: In addition, shoot accuracy sa2018
(messi) = .8 , sa2018
(Ronaldo) = .5
○ Probability that X scores most in 2018: P2018
(X) ∝ ngoals2018
(X).sa2018
(X)
● Scenario 3: But what if san
(X) ∝ Pn-1
(X) ?
○ ?
Scenario 1 -> Frequentist Model
Scenario 2 -> Bayesian Model
Scenario 3 -> Hierarchical Bayesian Model
18. Hierarchical Bayesian Models
● Statistical model written in multiple levels
(hierarchical form) that estimates the parameters
of the posterior distribution using the Bayesian
method.
● The sub-models combine to form the hierarchical
model.
● Bayes theorem is used to integrate them with the
observed data
19. Bayesian Language Models
● Let D be the data and Q be the parameters(n-gram probabilities here) and H be
some distribution.
● In Max-Likelihood models, we find Q that maximizes P(D|Q, H).
● But Bayes theorem gives us the posterior probability of the parameters Q as
P(Q|D,H) ∝ P(D|Q,H).P(Q|H)
Posterior ∝ Likelihood. Prior
● Since the Likelihood is a multinomial, the prior will be Dirichlet since it is
conjugate
20. Dirichlet, Hierarchical Dirichlet, Pitman-Yor
● A process is a - ‘Distribution over Distributions’
● The draws from a DP are random distributions that have certain properties.
○ ‘Base Distribution’ parameter just like mean in Normal Distribution
○ DP is non-parametric
● Hierarchical Dirichlet Process : Similar to Hierarchical Models
21. Pitman Yor Process
● It is also a distribution over distributions.
● But, sample draws from the G1= PY(d,ϴ,Go
) process gives a infinite discrete
probability distribution, consisting of
○ An infinite set of atoms drawn from G0
.
○ Weights drawn from a two-parameter Poisson–Dirichlet distribution.
■ It is a probability distribution on the set of decreasing positive
sequences with sum 1.
● A draw from G1 looks like
22. Chinese Restaurant Process
● x1,x2,... drawn from G1, Conditional distribution of the next draw after a
sequence of c
ck
is the number of customers sitting at table k
is the strength parameter.
t is the number of occupied tables.
is the point mass located at
● Some(De Finetti’s) theorem
○ Exchangeable sequences shows that there must be a distribution over
distributions G1 such that x1, x2, . . . are conditionally independent and
identical draws from G1.
● The Pitman-Yor process is one such distribution over G1.
23. Pitman Yor Process in Word Distributions
● Consider
○ Pitman Yor process as prior for unigram word distributions.
○ Base Distribution, G0
= Uniform over vocabulary.
● We Model the desired unigram distribution over words as the draw from the
Pitman-Yor process G1
24. Analogy with CRP
● cw
occurrences of word w ∈ W : cw
customers are eating dish w in the Chinese
restaurant representation
● But CRP has infinite tables and infinite capacity :/
○ tw
be the number of tables serving dish w.
● The predictive probability of a new word given the seating arrangement
● tw
=1 ? Absolute Discount
● Additive term can be understood as interpolation with the uniform distribution.
25. Language Modelling using Hierarchical PY
● Given
○ A context u consisting of a sequence of up to n − 1 words
● Gu
(w) be the distribution over the current word w.
● Use a Pitman-Yor process as the prior for Gu
(w).
● (u) is the suffix of u consisting of all but the first word.
26. Similarities with Interpolated KN
● The predictive probability of the next word after context u given the seating
arrangement is
● Suppose that
● Surprise ! The equation directly reduces to the predictive probabilities given by
interpolated Kneser-Ney
● Strength and discount parameters depend on the length of the context.
28. Details
● Data
○ 16 million word corpus from APNews
○ 1 million words from WSJ dataset
● ?-grams
○ Tri-gram for APNews data
○ Bi-gram for WSJ data
● Models
○ HPY
○ M-KN (cmax
= 2,3)
○ I-KN
29. Experiment 1
● IKN is just an approximation inference
scheme in the HPYlanguage model