SlideShare a Scribd company logo
1 of 29
Download to read offline
UNIVERSITY OF CHICAGO
Topic Modeling and its relation to Word
Embedding
by
Sihan Chen
A thesis submitted in partial fulfillment for the
degree of Master of Statistics
in the
Department of Statistics
July 2016
Declaration of Authorship
I, Sihan Chen, declare that this thesis titled, ‘Topic Modeling and its relation to
Word Embedding’ and the work presented in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a research degree
at this University.
Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
Where I have consulted the published work of others, this is always clearly at-
tributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself.
Signed:
Date:
i
UNIVERSITY OF CHICAGO
Abstract
Department of Statistics
Master of Statistics
by Sihan Chen
Topic Modeling has been a popular area of research for more than a decade. Each
vocabulary in a document is assigned to a topic, which is a distribution of words, and
thus, each article becomes a mixture of different kinds of topics. As a result, we can
group similar articles based on the topic assignment of their words. Even though a large
number of models has been applied to topic modeling, I mainly focus on two methods
of topic modeling: the Latent Dirichlet Allocation(LDA), and Von Mises-Fisher(vMF)
Clustering. The LDA model is based on the method Variational Bayes; the vMF Cluster-
ing, on the other hand, needs some extra information gained from Word Embedding. In
this paper, I will compare these two methods through the topic words, topic coocurence
words, and computing their Pointwise Mutual Information(PMI).
Acknowledgements
Thanks for advisor Professor John Lafferty, and his student Mahtiyar Bonakdarpour
iii
Contents
Declaration of Authorship i
Abstract ii
Acknowledgements iii
Abbreviations v
Symbols vi
1 Introduction and Moltivation 1
2 Development of Topic Models 2
2.1 Mixture of Unigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Probabilistic Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . 3
3 Latent Dirichlet Allocation 4
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Von Mises-Fisher Clustering 10
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Parameter Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Experiment 13
5.1 Topic Assignment of Words . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Word Cooccurence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 PMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Conclusion 21
Bibliography 22
iv
Abbreviations
LDA Latent Dirichlet Allocation
vMF Von Mises Fisher
v
Symbols
β topic
z topic assignment
θ topic proportion
K the number of topics
k the topic subscript
D the number of documents in the corpus
d the document subscript
N the number of words in the documents
w the specific observed word
vi
Chapter 1
Introduction and Moltivation
Topic Modeling has been a popular area of research for more than a decade. It can also
be applied to many areas of industry, such as search engine, document classification, and
etc. The topic has now been defined as the ”distribution of vocabulary”. Generally,such
a distribution is discrete since the number of words in the vocabulary is finite. In some
other case, however, topics can be continuous. When the words have been reprensented
by word embedding vectors, topics can be defined as the distribution of these embedding
vectors. Topic is the global variables that is defined on the corpus level
Therefore, some extended concept, Topic Proportions, and Topic Assignment have been
defined and parameterized in the topic model. Topic Proportions is the proportion of
each topic in a specific document. Topic Assignment refers to the topic assigned to
each word(Early topic model assigns topics to document level). After a word has been
assigned to a topic, the probability of that word appearing within that topic will be
maximized. Article can be classified according to the topic proportions of key words.
What’s more, some new words that have never shown up before can be assigned to a
topic according to the topic proportions of the document.
This paper will introduce some basic topic models. It is mostly about the comparison
of two topic models: Latent Dirichlet Allocation(LDA), and Von Mises-Fisher(vMF)
Clustering.
1
Chapter 2
Development of Topic Models
The concept of ”topic” depends on the topic models. Its meaning keeps changing as
topic models has been growing mature for decades. This claim can be proved by some
early examples of topic models from below:
2.1 Mixture of Unigrams
This is an early version of topic model. Each article has been assigned to only one topic.
The distribution of each vocabulary depends on the topic assigned to the article. This
is a huge progress from the unigram model, where the distribution of each vocabulary
is fixed, independent of topics. The probability of the document is:
p(w) = Σzp(z)ΠN
n=1p(wn|z)
This model suffers the problem of overfitting, in the sense that it does not account for
the words never appear in the training set. The probability of the new word will be
exceedingly low, and thus, the perplexity of of this model will explode.
2
Symbols 3
2.2 Probabilistic Latent Semantic Indexing
Probabilistic Latent Semantic Indexing(pLSI) is a topic model that assigns the topic to
the level of each word. As a result, each article contains multiple topics. The probability
of a words is defined to be:
p(wn) = Σzp(wn|z)p(z|d)
This model can assign a positive probability to an unseen word based on the proportion
of topics within each document. However, this model does not solve the problem of
overfitting, either. Since d can be as many as the number of document in the training
set, the topic assignment procedure cannot learn anything that can be generalized to the
set other than the training one. Thus, there is no standard way of assigning topics to
the words among unseen documents. As a result, the number of parameters that needs
to be estimated will increase linearly as the size of the document.
Chapter 3
Latent Dirichlet Allocation
3.1 Introduction
Latent Dirichlet Allocation(LDA) is a Bayesian Hierarchical Model that treats observed
words as observed variables, and contains multiple latent variables: z-topic assignment,
β-topic, and θ-topic proportion. β is a discrete distribution over fixed vocabulary; z is a
multinomial distribution over the K topics; θ works as the latent variables of z, whose
value serves as the probability of assigning a specific topic on z. Compared with the
pLSI, LDA introduces a new latent variable for the topic assignment: by adding another
K −1 parameters for the topic proportions, it can be applied to any new documents out
of training set without introducing new parameters.
In addition, to make the posterior inference work, LDA introduces the Dirichlet prior η
for the latent variables β, and α for the latent variable θ. Let V be the size of vocabulary
and K be the size of the topics. Then, the prior of β is V dimensional with variable
η, denoted by DirV (η); the prior of θ is K dimensional with variable α, denoted by
DirK(α). The generative process is the following:
1. Choose an value for the V-dim and K-dim prior parameter η, α.
2. For each topic k, Draw βk ∼ DirV (η) k = 1, · · · , K
3. For each document d, Draw θd ∼ DirK(α) d = 1, · · · , D
4
Symbols 5
4. For each word n,
(a) Draw zd,n ∼ MultK(θd) zd,n = 1, · · · , K n = 1, · · · , N
(b) Draw wd,n ∼ MultV (βzd,n
) wd,n = 1, · · · , V n = 1, · · · , N
From this process, we can see that there are three levels of variables: the corpus level,
the document level, and the word level. The β is the corpus level variable, θ is the
document level variable, and w, z are the word level variable. After this generating
process, we need to find the posterior distribution level by level.
3.2 Posterior Distribution
Finding the posterior distibution of latent variables is the essential step to draw any
useful and practical conclusion from the document.
After we know the distribution of θd for each article, we can classify the similar article
based on the difference of θ. We can use the formula below to measure the document
similarity(Blei and Lafferty 2009):
document − similarityd,f =
K
k=1
( ˆθd,k − ˆθf,k)2
Besides, knowing the value of βk for each topics can help us understand what each topic
is about. By knowing the distribution of a specific βk, we are able to know the prob-
ability of each word appearing in this topic, and thus, the most frequent words within
a topic. This notion is very helpful for all kinds of search engine: it will show up the
articles and website that are related to the key words, based on the frequency of this
key words appearing in different kinds of topics, not only the articles and website that
contain such key words.
Most importantly, after we know the value of θ, and β, we are able to gain the prob-
ability of topic assignment zd,n for a specific word within a document. Intuitively, the
probability of assigning topic k to zd,n is proportional to the topic proportion θd for k,
times probability of word wd,n appearing within topic k, which is the value for word wd,n
Symbols 6
in βk. The relationship is the following:
p(zn|z−n, w) = p(wn|β1:K) ·
θ
p(zn|θ)p(θ|z−n)dθ
However, we can see that it takes huge computation to get zn, since θ is K dimensional
and we need to integrate K times to get the result. If we consider the joint distribution
of βk, θd, zn, which is:
p(β1:K, θ1:D, z1:N |w1:D,1:N ) =
p(β1:K, θ1:D, z1:D, w1:D)
β1:K θ1:D z1:D
p(β1:K, θ1:D, z1:D, w1:D)
It becomes impossible to compute. Therefore, we need to come up with some other
method to estimate the latent variables z, β, and θ.
3.3 Variational Inference
To solve the computational problem of standard method of posterior inference, we need
to find some other way to approximate the distribution of latent variables. In this
project, I choose the variational inference method.
First of all, the variational parameters has to be introduced for the object variables
β, θ, and z. Let λ, γ, and φ to be the parameters of β, θ, and z, respectively. Then,
we can denote the variational distributin of β, θ, and z to be q(β|λ), q(θ|γ), and q(z|φ).
Therefore, we have the following relationship:
q(β|λ) = DirV (λ); q(θ|γ) = DirK(γ); q(z|φ) = Mult(φ)
As a result, the object three variables β, θ, and z become conditionally independent. Its
joint distribution just becomes:
q(β1:K, θ1:D, z1:D,1:N ) =
K
k=1
q(βk|λk)
D
d=1
(q(θd|γd)
N
n=1
q(zd,n|φd,n))
Symbols 7
In order to make the joint distribution as close to the true distribution, we need to
minimize the Kullback-Leibler(KL) distance by finding the optimal variational parame-
ters(Blei and Lafferty 2009):
argγ1:D,λ1:K ,φ1:D,1:N
minKL(q(θ1:D, β1:K, z1:D,1:N )||p(θ1:D, β1:K, z1:D,1:N ))
Now, we define the Evidence Lower Bound to be:
L(w, φ, γ, λ) := Eq[logp(w, z, θ, β)] − Eq[logq(z, θ, β)]
Since we have this relation:
logp(w|α, η) = log
θ β z
p(θ, β, z, w|α, η)dθdβ
= log
θ β z
p(θ, β, z, w|α, η)q(θ, β, z)
q(θ, β, z)
dθdβ
≥
θ β z
q(θ, β, z)logp(θ, β, z, w|α, η)dθdβ −
θ β z
q(θ, β, z)logq(θ, β, z)dθdβ
= Eqlogp(θ, β, z, w|α, η) − Eqlogq(θ, β, z)
= L(w, φ, γ, λ)
and:
KL(q(θ.β, z)||p(θ.β, z)) =
θ β z
q(θ, β, z)log
q(θ, β, z)
p(θ, β, z)
dθdβ
=
θ β z
q(θ, β, z)logq(θ, β, z)dθdβ −
θ β z
q(θ, β, z)logp(θ, β, z)dθdβ
= Eqlogq(θ, β, z) − Eqlogp(θ, β, z)
Therefore, we see that:
L(w, φ, γ, λ)+KL(q(θ.β, z)||p(θ.β, z)) = Eqlogp(θ, β, z, w|α, η)−Eqlogp(θ, β, z) = logp(w|α, η).
Thus, maximizing L(w, φ, γ, λ) is equivalent to minimizing the KL distance.
Symbols 8
Now, let us factorize the equation L(w, φ, γ, λ) in the form (Hoffman et all 2010):
L(w, φ, γ, λ) =
d
{Eq[logp(wd|θd, zd, β)] + Eq[logp(zd|θd)] − Eq[logq(zd)] + Eq[logp(θd|α)]
− Eq[logq(θd)] − (Eq[logp(β|η)] + Eq[logq(β)])/D}
=
d w
Ndw
k
φdwk(Eq[logθdk] + Eq[logβkw] − logφdwk)
− logΓ(
k
γdk) +
k
(α − γdk)Eq[logθdk] + logΓ(λdk)
+ (
k
−logΓ(
w
λkw) +
w
(η − λkw)Eq[logβkw] + logΓ(λkw))/D
= logΓ(Kα) − KlogΓ(α) + (logΓ(Wη) − WlogΓ(η))/D
:=
d
l(Nd, φd, λd, λ)
Taking the derivative with respect to φdwk, γdk, and λkw, and set them to 0, we get
the following update relation for these parameters:
φdwk ∝ exp{Eq[log(θdk)]+Eq[logβkw]} γdk = α+
w
Ndwφdwk λkw = η+
d
Ndwφwdk
It is tempting to think about update φdwk, γdk, and λkw for each document d. However,
since the number of ducument is huge online, this update still takes much computations.
There is another algorithm solving this problem by considering all documents as a single
document. Then, it simply repeats D times, where each time, the number of counts of
words are the same. Here is the following algorithms(Hoffman et all 2010):
OnlineLDA:
Define ρt = (τ0 + t)−κ
Initialize λ randomly.
for t= 0 to D do:
E step:
initialize γtk = 1. (The constant 1 is arbitrary).
repeat
Set φtwk ∝ exp{Eq[logθtk] + Eq[logβkw]}
Set γtk = α + w φtwkNtw
Symbols 9
until 1
K k change inγtk < 0.00001
M step:
Compute ˜λkw = η + DNtwφtwk
Set λ = (1 − ρt)λ + ρt
˜λ
end for
Therefore, the number of times word w appearing in the combined document, Ntw,
is the same for all t.
After we obtain the topic parameters λkw, we start to find the other document-specific
parameters γdk, and φdwk on the test set. We use the following algorithm(Hoffman et
all 2010):
Batch LDA:
Choose the λ we obtain from OnlineLDA
while relative improvement in L(w, φ, γ, λ) > 0.00001 do
for d= 0 to D do:
Initialize γtk = 1. (The constant 1 is arbitrary).
repeat
Set φdwk ∝ exp{Eq[logθdk] + Eq[logβdw]}
Set γdk = α + w φdwkNdw
until 1
K k change inγdk < 0.00001
end for
end while
After we get these latent variational parameters, we are able to approximate the la-
tent variables β, θ, and z. Now we are able to assign topics on words in the document.
Chapter 4
Von Mises-Fisher Clustering
4.1 Introduction
As a topic model, Latent Dirichlet Allocation model does not take into account the
semantic meaning of each vocabulary. To solve this problem, a new model called Von
Mises-Fisher(vMF) Clustering has been used to do topic modeling. This model is actu-
ally a generalized LDA model, where the topic becomes a continuous distribution, while
the topic proportions are still drawn from the Von Mises-Fisher(vMF) Clustering.
Each topic is defined to be a point on a generalized unit sphere. For topic k, it chooses
µk as a center; then it chooses a variance parameter so that each word , assigned topic
k, is generated by a Von Mises-Fisher distribution(vMF) with dispersion parameters κk,
and center µk. The µk also has a vMF prior distribution with parameters µ0, and C0.
The distribution of xdn given µk, and κk is given by:
f(xdn; µk, κk) = exp(κkµk
t
xdn)CM (κk)
where M is the dimension of the word embedding vector, and CM (κk) is the normalizing
constant, which equals to:
κ0.5D−1
k
(2π)0.5DI0.5D−1(κk)
, and Iv(α) is the modified Bessel function.
The Topic proportions parameter θ is still drawn from the same process as LDA. We
10
Symbols 11
get the following generative process for vMF clustering(Gopal et all.,2014):
θd ∼ Dirichlet(α) d = 1, · · · , D
κk ∼ log-Normal(m, σ2
) k = 1, · · · , K
µk ∼ vMF(µ0, C0) k = 1, · · · , K
zdn ∼ Mult(θd) d = 1, · · · , D n = 1, · · · , N
xdn ∼ vMF(µzdn
, κzdn
) d = 1, · · · , D n = 1, · · · , N
4.2 Parameter Inference
Just as the LDA method, it is impossible to directly find the posterior distribution of
the latent variables from such formula:
p(z1:D,1:N , θ1:D, µ1:K, κ1:K|w1:D) =
p(z1:D,1:N , θ1:D, µ1:K, κ1:K, w1:D)
κ1:K µ1:K π1:D z1:D,1:N
p(z1:D,1:N , θ1:D, µ1:K, κ1:K, w1:D)
One way to approximate the posterior distribution is using the variational inference.
Again, we denote the variational distribution by q. First of all, we introduce the varia-
tional parameters for each latent variables to break down the dependency of these latent
variables:
q(z, β, π, µ, κ) = q(z)q(β)q(π)q(µ)q(κ)
For variables z, π, and µ, we introduce the variational parameters λ, ρ, ψ, and γ, such
that(Gopal et all.,2014):
q(zdn) = Mult(z|λdn) d = 1, · · · , D n = 1, · · · , N
q(θd) = Dir(θd|ρ) d = 1, · · · , D
q(µk) = vMF(µk|ψ, γ) k = 1, · · · , K
Symbols 12
The update equation for these variational parameters are the following(Gopal et all.,2014):
Rk = Eq[κk]
N
n=1
Eq[znk]xn + C0µ0
ψk =
Rk
Rk
γk = Rk
λdnk ∝ exp{Eq[logvMF(xdn|µk, κk)] + Eqlogθdk}
logvMF(xn|µk, κk) = Eq[logCM (κk)] + Eq[κk]xT
n Eq[µk]
For the concentration parameter κk, it is impossible to do the variational inference be-
cause there is no conjugate prior for the log-Normal function. Thus, we approximate its
distribution function by sampling method.
However, the sampling method needs the real samples generated by true conditional
distribution, and there is only posterior distribution provided by variational parame-
ters. But we can approximate the conditional distribution by the expectation of the
posterior distribution(Gopal et all.,2014):
P(κk|X, m, 2, µ0, C0, α) ∝ P(κk, X|m, σ2, µ0, C0, α)
≈ EqP(κk, X, Z, µ, κk, |m, σ2, µ0, C0, α)
≥ exp{EqlogP(κk, X, Z, µ, κk|m, σ2, µ0, C0, n)}
∝ exp{
N
n=1
Eq[znk]logCM (κk) + κk
N
n=1
Eq[znk]xT
n Eq[µk]}
× logNormal(κk|m, σ2)
(4.1)
Therefore, we can use the Metropolis Hastings sampling with the log-Normal distri-
bution as the proposal distribution.
Chapter 5
Experiment
Before, we do anything on the test set, we train on the online wikipedia set to get the
value of the global variable topics. For LDA method, we update λ, which is the varia-
tional parameters for the β through the training process. For vMF Clustering method,
we obtain the word embedding vectors through the online wikipedia set. However, the
variables µk, and κk which determines the topic distribution of k are not trained online
previously. Instead, we need to estimate µd, and κd for each document on the test set.
Combined with the topic distributions, the topic assignment of each test document
can be achieved, using the topic proportions of each articles. All the conclusions have
been shown below after topic assignment has been completed.
5.1 Topic Assignment of Words
For each topic, we count the number of times that each word have been assigned to
each topic. We choose the number of topic to be ten, and obtain the following list of
the words that have been assigned to each topic. Among all documents in the test set,
the words showing up in the list have the most count in each topic. Notice that such
a list is not a topic. Instead, it is just the topic assignment for all articles in the test
set, and these articles contain lots of ”junk words”, which have no semantic meaning,
e.g ”www”, ”http”, ”ref”, and etc.
13
Symbols 14
LDA
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
http ref http ref http
ref www bank france date
www date ref code ref
date jordan world date www
german time case language october
july key category game november
band work goals work band
bank american language record page
letter people london england ing
case programming date national year
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
ref http ref ref ref
www ref http http www
http www www code world
june date germany date http
bank world date world states
year swimming year swimming america
file year world letter james
music key work language year
link national book time key
july case band key case
Symbols 15
vMF Clustering
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
letter high date history swimming
books system world page record
author power year text game
book data time article player
james object states press final
david convert june journal top
award support april version tour
letter high july law university
english model march life gold
john energy october science major
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
case key code bank ref
category work language national http
left band list main www
number music programming file pdf
type group including international retrived
single style small link index
set led company global net
results song languages center ing
individual made class financial external
called records format source em
Compared with the two tables from the two methods, we see that some ”junk words”
show up many times among different columns in LDA method. This is because there
are lots of ”junk words” in those test articles. However, words show up at most once for
a single column in vMF Clustering method. We can conclude that in vMF Clustering
method, each word tends to be assigned to only one topic regardless of the context,
while in LDA, each word can be assigned to different kinds of topics depending on the
document it belongs to. Therefore, we can say that for vMF method, the topic assign-
ment of word depends on the semantics of the word itself.
Symbols 16
From these two tables, we see that the vMF Clustering method performs better, in
some sense. This is because the vMF Clustering method are better at dealing with a
large proportion of ”junk words” by assigning them all in a single topic.
5.2 Word Cooccurence
In this context, the word coocurence is defined to be the two words with the same topic
that are next to each other in the document. The word cooccurence defines a topic in a
better sense because each word can have different meanings corresponding to different
topics, and the two words occuring together take into account the factor of context. In
the following two tables, I will show the cooccurence of most frequent word pairs within
each topics using LDA and vMF method, respectively.
Here is the result from the LDA:
Symbols 17
Topic 1 Topic 2 Topic 3 Topic 4
(language,german) (michael,jordan) (http,www) (date,november)
(main,page) (iso,iso) (player,award) (letter,case)
(research,institute) (years,years) (scoring,goals) (state,single)
(mail,date) (science,recursion) (score,goals) (top,level)
(single,games) (work,time) (key,case) (num,date)
(record,high) (santa,cruz) (system,cross) (flash,flash)
(institute,date) (european,descent) (english,language) (holding,company
(stock,exchange) (john,hughes) (mathematics,mathematics) (page,base)
(short,forms) (link,date) (iso,alpha) (mixed,types)
Topic 5 Topic 6 Topic 7
(world,economic) (retrieved,june) (http,www)
(australia,years) (function,computing) (literature,literature)
(credit,card) (science,data) (research,info)
(return,return) (appearance,record) (run,time)
(volume,issue) (classical,music) (land,area)
(object,oriented) (power,forward) (page,base)
(fat,cat) (hit,single) (string,format)
(deal,location) (history,formation) (line,comments)
(variables,variables) (million,copies) (typically,occurs)
Topic 8 Topic 9 Topic 10
(date,janurary) (programming,language) (year,population)
(july,date) (key,case) (http,www)
(june,area) (letter,case) (programming,language)
(year,swimming) (time,list) (key,case)
(york,stock) (function,parameter) (time,top)
(east,germany) (times,guide) (august,image)
(january,established) (prototype,based) (date,release)
(computer,graphics) (based,programming) (population,density)
(computer,science) (computer,science) (letter,case)
Symbols 18
Here is the resulf from vMF Clustering:
Topic 1 Topic 2 Topic 3 Topic 4
(author,stephen) (power,parity) (date,january) (volume,issue)
(stephen,thomas) (array,data) (date,april) (issue,pages)
(bill,russell) (system,static) (date,october) (science,nature)
(chris,martin) (increase,decrease) (date,november) (review,vol)
(van,der) (rate,spikes) (date,august) (volume,page)
(literature,literature) (growth,rate) (largest,city) (history,background)
(harris,john) (air,force) (year,population) (journal,page)
(author,fox) (high,rate) (date,december) (issue,page)
(author,chan) (high,voltage) (london,england) (detail,article)
Topic 5 Topic 6 Topic 7
(credit,card) (left,upright) (music,heavy)
(university,students) (color,black) (song,records)
(final,game) (function,parameter) (recording,sessions)
(scoring,goals) (upright,left) (classical,music)
(score,goals) (symbol,type) (pop,music)
(player,diego) (upper,lip) (making,music)
(width,diego) (considered,integral) (deal,work)
(training,session) (lowest,point) (performance,style)
(semi,final) (lower,alpha) (electronic,music)
Topic 8 Topic 9 Topic 10
(programming,language) (development,report) (http,www)
(class,file) (human,development) (index,index)
(language,languages) (bank,cooperate) (em,external)
(class,computer) (economic,community) (proc,arg)
(stock,exchange) (research,institute) (r,pdf)
(computer,programming) (location,map) (pdf,pdf)
(iso,iso) (trade,organization) (uni,index)
(based,programming) (management,financial) (dec,ai)
(prototype,based) (national,bank) (comp,index)
Symbols 19
Though two table shows different results of word pairs, we see that they both cap-
ture the pairs that are representative of a topic. For the LDA method, the ”junk words”
do not show up as frequently as the previous single word table. Therefore, we can con-
clude that the cooccurence word table better represent a topic than single word table.
5.3 PMI
Our final standard to evaluate the method of topic modeling is Pointwise Mutual Infor-
mation(PMI). It is defined in the following formula:
PMIk(wi, wj) = log
pk(wi, wj)
pk(wi)pk(wj)
pk(wi, wj) refers to the probability of cooccurence of two words within topic k, p(wi)
refers to the probability of word wi occuring in topic k, and p(wj) refers to the prob-
ability of word wj occuring in topic k. The higher the PMI is, the better the topic
model performs. This because the higher probability that two words next to each other
assigned to the same topic, the better the topic model is. Here are the results of the
highest PMI values for each topic between LDA and vMF Clustering methods:
LDA
Symbols 20
Topic 1: 9.14 8.74 8.66 8.04 7.94 7.53 7.31 7.10 7.06 7.00
Topic 2: 9.87 9.87 9.87 9.17 8.48 8.48 8.48 8.22 8.07 8.07
Topic 3: 9.20 9.20 9.01 8.79 8.28 7.94 7.92 7.81 7.81 7.73
Topic 4: 9.59 9.59 8.49 8.49 7.98 7.98 7.80 7.80 7.51 7.47
Topic 5: 9.16 8.88 8.84 8.47 8.25 8.25 7.46 7.29 7.22 7.07
Topic 6: 8.48 8.48 8.37 8.19 7.90 7.78 7.78 7.60 7.56 7.50
Topic 7: 9.62 9.03 9.03 8.74 8.63 8.11 8.11 7.93 7.83 7.79
Topic 8: 8.84 8.62 8.44 8.25 8.15 8.07 7.93 7.63 7.34 7.31
Topic 9: 9.66 9.01 8.60 7.95 7.65 7.56 7.46 7.30 7.09 7.08
Topic 10: 10.5 9.81 9.00 8.47 7.76 7.68 7.61 7.61 7.47 7.00
vMF Clustering
Topic 1: 7.44 7.06 5.43 5.10 4.67 4.42 4.40 4.16 4.16 4.05
Topic 2: 6.92 6.71 6.64 5.87 5.82 5.80 5.78 5.72 5.72 5.54
Topic 3: 5.62 4.57 4.47 4.45 4.31 3.58 3.55 3.47 3.31 3.22
Topic 4: 5.68 5.47 5.10 5.06 4.20 4.15 3.87 3.36 3.36 3.35
Topic 5: 5.27 4.35 4.30 4.27 4.13 4.06 3.86 3.79 3.33 3.23
Topic 6: 8.86 7.08 6.25 6.18 6.15 6.03 5.96 5.82 5.51 5.30
Topic 7: 6.61 5.45 5.32 4.62 4.56 4.52 4.36 4.24 4.11 4.07
Topic 8: 5.29 5.06 4.46 4.46 4.43 4.42 4.36 4.08 4.04 3.94
Topic 9: 5.58 5.10 4.63 4.19 4.03 3.94 3.68 3.62 3.59 3.48
Topic 10: 9.02 8.45 7.75 7.58 5.33 4.99 4.83 4.15 3.91 3.68
From these two tables, we see that the PMI of LDA is higher than the PMI of sHDP.
This shows that the LDA performs better than vMF Clustering in terms of PMI score.
Chapter 6
Conclusion
The LDA and vMF Clustering are two representatives of topic modeling methods. In
terms of assigning a topic to a word, both method takes into account the two factors: the
topic distribution for a vocabulary and the document-specific topic proportions. They
both use the variational inference to approximate the distribution of the latent variables
through online training. Compared with the LDA method, vMF Clustering considers
the semantic meaning of each word, and thus, consider more about the word itself rather
than the context of a specific document when doing the topic assignment. One thing we
could make for sure is that the vMF Clustering are better at dealing with ”junk words”.
21
Bibliography
David M Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty editor. Latent
Dirichlet Allocation Allocation, In Journal of Machine Learning Research 3 (2003)
D. Blei, and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text
Mining: Classification, Clustering, and Applications. Chapman Hall/CRC Data Mining
and Knowledge Discovery Series, 2009.
Siddharth Gopal, Yiming Yang. Von Mises-Fisher Clustering Models, In Journal of
Machine Learning Research, Workshop and Conference Proceedings, Vol 32
22

More Related Content

What's hot

Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationMarco Righini
 
Topic models
Topic modelsTopic models
Topic modelsAjay Ohri
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsVincenzo Lomonaco
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
Information Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilarityInformation Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilaritySaswat Padhi
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_ProjectreportSampath Velaga
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Word2Vec on Italian language
Word2Vec on Italian languageWord2Vec on Italian language
Word2Vec on Italian languageFrancesco Cucari
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 

What's hot (20)

Canini09a
Canini09aCanini09a
Canini09a
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Topic models
Topic modelsTopic models
Topic models
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Parekh dfa
Parekh dfaParekh dfa
Parekh dfa
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Information Retrieval using Semantic Similarity
Information Retrieval using Semantic SimilarityInformation Retrieval using Semantic Similarity
Information Retrieval using Semantic Similarity
 
data_mining_Projectreport
data_mining_Projectreportdata_mining_Projectreport
data_mining_Projectreport
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Word2Vec on Italian language
Word2Vec on Italian languageWord2Vec on Italian language
Word2Vec on Italian language
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 

Similar to graduate_thesis (1)

Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingTomonari Masada
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networksconnectbeubax
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingTomonari Masada
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesBryan Gummibearehausen
 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...TELKOMNIKA JOURNAL
 
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...ijsc
 

Similar to graduate_thesis (1) (20)

Topic modelling
Topic modellingTopic modelling
Topic modelling
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic Modeling
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linking
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Lec1
Lec1Lec1
Lec1
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
 
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
 

graduate_thesis (1)

  • 1. UNIVERSITY OF CHICAGO Topic Modeling and its relation to Word Embedding by Sihan Chen A thesis submitted in partial fulfillment for the degree of Master of Statistics in the Department of Statistics July 2016
  • 2. Declaration of Authorship I, Sihan Chen, declare that this thesis titled, ‘Topic Modeling and its relation to Word Embedding’ and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly at- tributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: i
  • 3. UNIVERSITY OF CHICAGO Abstract Department of Statistics Master of Statistics by Sihan Chen Topic Modeling has been a popular area of research for more than a decade. Each vocabulary in a document is assigned to a topic, which is a distribution of words, and thus, each article becomes a mixture of different kinds of topics. As a result, we can group similar articles based on the topic assignment of their words. Even though a large number of models has been applied to topic modeling, I mainly focus on two methods of topic modeling: the Latent Dirichlet Allocation(LDA), and Von Mises-Fisher(vMF) Clustering. The LDA model is based on the method Variational Bayes; the vMF Cluster- ing, on the other hand, needs some extra information gained from Word Embedding. In this paper, I will compare these two methods through the topic words, topic coocurence words, and computing their Pointwise Mutual Information(PMI).
  • 4. Acknowledgements Thanks for advisor Professor John Lafferty, and his student Mahtiyar Bonakdarpour iii
  • 5. Contents Declaration of Authorship i Abstract ii Acknowledgements iii Abbreviations v Symbols vi 1 Introduction and Moltivation 1 2 Development of Topic Models 2 2.1 Mixture of Unigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Probabilistic Latent Semantic Indexing . . . . . . . . . . . . . . . . . . . . 3 3 Latent Dirichlet Allocation 4 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Posterior Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4 Von Mises-Fisher Clustering 10 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Parameter Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 5 Experiment 13 5.1 Topic Assignment of Words . . . . . . . . . . . . . . . . . . . . . . . . . . 13 5.2 Word Cooccurence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.3 PMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 6 Conclusion 21 Bibliography 22 iv
  • 6. Abbreviations LDA Latent Dirichlet Allocation vMF Von Mises Fisher v
  • 7. Symbols β topic z topic assignment θ topic proportion K the number of topics k the topic subscript D the number of documents in the corpus d the document subscript N the number of words in the documents w the specific observed word vi
  • 8. Chapter 1 Introduction and Moltivation Topic Modeling has been a popular area of research for more than a decade. It can also be applied to many areas of industry, such as search engine, document classification, and etc. The topic has now been defined as the ”distribution of vocabulary”. Generally,such a distribution is discrete since the number of words in the vocabulary is finite. In some other case, however, topics can be continuous. When the words have been reprensented by word embedding vectors, topics can be defined as the distribution of these embedding vectors. Topic is the global variables that is defined on the corpus level Therefore, some extended concept, Topic Proportions, and Topic Assignment have been defined and parameterized in the topic model. Topic Proportions is the proportion of each topic in a specific document. Topic Assignment refers to the topic assigned to each word(Early topic model assigns topics to document level). After a word has been assigned to a topic, the probability of that word appearing within that topic will be maximized. Article can be classified according to the topic proportions of key words. What’s more, some new words that have never shown up before can be assigned to a topic according to the topic proportions of the document. This paper will introduce some basic topic models. It is mostly about the comparison of two topic models: Latent Dirichlet Allocation(LDA), and Von Mises-Fisher(vMF) Clustering. 1
  • 9. Chapter 2 Development of Topic Models The concept of ”topic” depends on the topic models. Its meaning keeps changing as topic models has been growing mature for decades. This claim can be proved by some early examples of topic models from below: 2.1 Mixture of Unigrams This is an early version of topic model. Each article has been assigned to only one topic. The distribution of each vocabulary depends on the topic assigned to the article. This is a huge progress from the unigram model, where the distribution of each vocabulary is fixed, independent of topics. The probability of the document is: p(w) = Σzp(z)ΠN n=1p(wn|z) This model suffers the problem of overfitting, in the sense that it does not account for the words never appear in the training set. The probability of the new word will be exceedingly low, and thus, the perplexity of of this model will explode. 2
  • 10. Symbols 3 2.2 Probabilistic Latent Semantic Indexing Probabilistic Latent Semantic Indexing(pLSI) is a topic model that assigns the topic to the level of each word. As a result, each article contains multiple topics. The probability of a words is defined to be: p(wn) = Σzp(wn|z)p(z|d) This model can assign a positive probability to an unseen word based on the proportion of topics within each document. However, this model does not solve the problem of overfitting, either. Since d can be as many as the number of document in the training set, the topic assignment procedure cannot learn anything that can be generalized to the set other than the training one. Thus, there is no standard way of assigning topics to the words among unseen documents. As a result, the number of parameters that needs to be estimated will increase linearly as the size of the document.
  • 11. Chapter 3 Latent Dirichlet Allocation 3.1 Introduction Latent Dirichlet Allocation(LDA) is a Bayesian Hierarchical Model that treats observed words as observed variables, and contains multiple latent variables: z-topic assignment, β-topic, and θ-topic proportion. β is a discrete distribution over fixed vocabulary; z is a multinomial distribution over the K topics; θ works as the latent variables of z, whose value serves as the probability of assigning a specific topic on z. Compared with the pLSI, LDA introduces a new latent variable for the topic assignment: by adding another K −1 parameters for the topic proportions, it can be applied to any new documents out of training set without introducing new parameters. In addition, to make the posterior inference work, LDA introduces the Dirichlet prior η for the latent variables β, and α for the latent variable θ. Let V be the size of vocabulary and K be the size of the topics. Then, the prior of β is V dimensional with variable η, denoted by DirV (η); the prior of θ is K dimensional with variable α, denoted by DirK(α). The generative process is the following: 1. Choose an value for the V-dim and K-dim prior parameter η, α. 2. For each topic k, Draw βk ∼ DirV (η) k = 1, · · · , K 3. For each document d, Draw θd ∼ DirK(α) d = 1, · · · , D 4
  • 12. Symbols 5 4. For each word n, (a) Draw zd,n ∼ MultK(θd) zd,n = 1, · · · , K n = 1, · · · , N (b) Draw wd,n ∼ MultV (βzd,n ) wd,n = 1, · · · , V n = 1, · · · , N From this process, we can see that there are three levels of variables: the corpus level, the document level, and the word level. The β is the corpus level variable, θ is the document level variable, and w, z are the word level variable. After this generating process, we need to find the posterior distribution level by level. 3.2 Posterior Distribution Finding the posterior distibution of latent variables is the essential step to draw any useful and practical conclusion from the document. After we know the distribution of θd for each article, we can classify the similar article based on the difference of θ. We can use the formula below to measure the document similarity(Blei and Lafferty 2009): document − similarityd,f = K k=1 ( ˆθd,k − ˆθf,k)2 Besides, knowing the value of βk for each topics can help us understand what each topic is about. By knowing the distribution of a specific βk, we are able to know the prob- ability of each word appearing in this topic, and thus, the most frequent words within a topic. This notion is very helpful for all kinds of search engine: it will show up the articles and website that are related to the key words, based on the frequency of this key words appearing in different kinds of topics, not only the articles and website that contain such key words. Most importantly, after we know the value of θ, and β, we are able to gain the prob- ability of topic assignment zd,n for a specific word within a document. Intuitively, the probability of assigning topic k to zd,n is proportional to the topic proportion θd for k, times probability of word wd,n appearing within topic k, which is the value for word wd,n
  • 13. Symbols 6 in βk. The relationship is the following: p(zn|z−n, w) = p(wn|β1:K) · θ p(zn|θ)p(θ|z−n)dθ However, we can see that it takes huge computation to get zn, since θ is K dimensional and we need to integrate K times to get the result. If we consider the joint distribution of βk, θd, zn, which is: p(β1:K, θ1:D, z1:N |w1:D,1:N ) = p(β1:K, θ1:D, z1:D, w1:D) β1:K θ1:D z1:D p(β1:K, θ1:D, z1:D, w1:D) It becomes impossible to compute. Therefore, we need to come up with some other method to estimate the latent variables z, β, and θ. 3.3 Variational Inference To solve the computational problem of standard method of posterior inference, we need to find some other way to approximate the distribution of latent variables. In this project, I choose the variational inference method. First of all, the variational parameters has to be introduced for the object variables β, θ, and z. Let λ, γ, and φ to be the parameters of β, θ, and z, respectively. Then, we can denote the variational distributin of β, θ, and z to be q(β|λ), q(θ|γ), and q(z|φ). Therefore, we have the following relationship: q(β|λ) = DirV (λ); q(θ|γ) = DirK(γ); q(z|φ) = Mult(φ) As a result, the object three variables β, θ, and z become conditionally independent. Its joint distribution just becomes: q(β1:K, θ1:D, z1:D,1:N ) = K k=1 q(βk|λk) D d=1 (q(θd|γd) N n=1 q(zd,n|φd,n))
  • 14. Symbols 7 In order to make the joint distribution as close to the true distribution, we need to minimize the Kullback-Leibler(KL) distance by finding the optimal variational parame- ters(Blei and Lafferty 2009): argγ1:D,λ1:K ,φ1:D,1:N minKL(q(θ1:D, β1:K, z1:D,1:N )||p(θ1:D, β1:K, z1:D,1:N )) Now, we define the Evidence Lower Bound to be: L(w, φ, γ, λ) := Eq[logp(w, z, θ, β)] − Eq[logq(z, θ, β)] Since we have this relation: logp(w|α, η) = log θ β z p(θ, β, z, w|α, η)dθdβ = log θ β z p(θ, β, z, w|α, η)q(θ, β, z) q(θ, β, z) dθdβ ≥ θ β z q(θ, β, z)logp(θ, β, z, w|α, η)dθdβ − θ β z q(θ, β, z)logq(θ, β, z)dθdβ = Eqlogp(θ, β, z, w|α, η) − Eqlogq(θ, β, z) = L(w, φ, γ, λ) and: KL(q(θ.β, z)||p(θ.β, z)) = θ β z q(θ, β, z)log q(θ, β, z) p(θ, β, z) dθdβ = θ β z q(θ, β, z)logq(θ, β, z)dθdβ − θ β z q(θ, β, z)logp(θ, β, z)dθdβ = Eqlogq(θ, β, z) − Eqlogp(θ, β, z) Therefore, we see that: L(w, φ, γ, λ)+KL(q(θ.β, z)||p(θ.β, z)) = Eqlogp(θ, β, z, w|α, η)−Eqlogp(θ, β, z) = logp(w|α, η). Thus, maximizing L(w, φ, γ, λ) is equivalent to minimizing the KL distance.
  • 15. Symbols 8 Now, let us factorize the equation L(w, φ, γ, λ) in the form (Hoffman et all 2010): L(w, φ, γ, λ) = d {Eq[logp(wd|θd, zd, β)] + Eq[logp(zd|θd)] − Eq[logq(zd)] + Eq[logp(θd|α)] − Eq[logq(θd)] − (Eq[logp(β|η)] + Eq[logq(β)])/D} = d w Ndw k φdwk(Eq[logθdk] + Eq[logβkw] − logφdwk) − logΓ( k γdk) + k (α − γdk)Eq[logθdk] + logΓ(λdk) + ( k −logΓ( w λkw) + w (η − λkw)Eq[logβkw] + logΓ(λkw))/D = logΓ(Kα) − KlogΓ(α) + (logΓ(Wη) − WlogΓ(η))/D := d l(Nd, φd, λd, λ) Taking the derivative with respect to φdwk, γdk, and λkw, and set them to 0, we get the following update relation for these parameters: φdwk ∝ exp{Eq[log(θdk)]+Eq[logβkw]} γdk = α+ w Ndwφdwk λkw = η+ d Ndwφwdk It is tempting to think about update φdwk, γdk, and λkw for each document d. However, since the number of ducument is huge online, this update still takes much computations. There is another algorithm solving this problem by considering all documents as a single document. Then, it simply repeats D times, where each time, the number of counts of words are the same. Here is the following algorithms(Hoffman et all 2010): OnlineLDA: Define ρt = (τ0 + t)−κ Initialize λ randomly. for t= 0 to D do: E step: initialize γtk = 1. (The constant 1 is arbitrary). repeat Set φtwk ∝ exp{Eq[logθtk] + Eq[logβkw]} Set γtk = α + w φtwkNtw
  • 16. Symbols 9 until 1 K k change inγtk < 0.00001 M step: Compute ˜λkw = η + DNtwφtwk Set λ = (1 − ρt)λ + ρt ˜λ end for Therefore, the number of times word w appearing in the combined document, Ntw, is the same for all t. After we obtain the topic parameters λkw, we start to find the other document-specific parameters γdk, and φdwk on the test set. We use the following algorithm(Hoffman et all 2010): Batch LDA: Choose the λ we obtain from OnlineLDA while relative improvement in L(w, φ, γ, λ) > 0.00001 do for d= 0 to D do: Initialize γtk = 1. (The constant 1 is arbitrary). repeat Set φdwk ∝ exp{Eq[logθdk] + Eq[logβdw]} Set γdk = α + w φdwkNdw until 1 K k change inγdk < 0.00001 end for end while After we get these latent variational parameters, we are able to approximate the la- tent variables β, θ, and z. Now we are able to assign topics on words in the document.
  • 17. Chapter 4 Von Mises-Fisher Clustering 4.1 Introduction As a topic model, Latent Dirichlet Allocation model does not take into account the semantic meaning of each vocabulary. To solve this problem, a new model called Von Mises-Fisher(vMF) Clustering has been used to do topic modeling. This model is actu- ally a generalized LDA model, where the topic becomes a continuous distribution, while the topic proportions are still drawn from the Von Mises-Fisher(vMF) Clustering. Each topic is defined to be a point on a generalized unit sphere. For topic k, it chooses µk as a center; then it chooses a variance parameter so that each word , assigned topic k, is generated by a Von Mises-Fisher distribution(vMF) with dispersion parameters κk, and center µk. The µk also has a vMF prior distribution with parameters µ0, and C0. The distribution of xdn given µk, and κk is given by: f(xdn; µk, κk) = exp(κkµk t xdn)CM (κk) where M is the dimension of the word embedding vector, and CM (κk) is the normalizing constant, which equals to: κ0.5D−1 k (2π)0.5DI0.5D−1(κk) , and Iv(α) is the modified Bessel function. The Topic proportions parameter θ is still drawn from the same process as LDA. We 10
  • 18. Symbols 11 get the following generative process for vMF clustering(Gopal et all.,2014): θd ∼ Dirichlet(α) d = 1, · · · , D κk ∼ log-Normal(m, σ2 ) k = 1, · · · , K µk ∼ vMF(µ0, C0) k = 1, · · · , K zdn ∼ Mult(θd) d = 1, · · · , D n = 1, · · · , N xdn ∼ vMF(µzdn , κzdn ) d = 1, · · · , D n = 1, · · · , N 4.2 Parameter Inference Just as the LDA method, it is impossible to directly find the posterior distribution of the latent variables from such formula: p(z1:D,1:N , θ1:D, µ1:K, κ1:K|w1:D) = p(z1:D,1:N , θ1:D, µ1:K, κ1:K, w1:D) κ1:K µ1:K π1:D z1:D,1:N p(z1:D,1:N , θ1:D, µ1:K, κ1:K, w1:D) One way to approximate the posterior distribution is using the variational inference. Again, we denote the variational distribution by q. First of all, we introduce the varia- tional parameters for each latent variables to break down the dependency of these latent variables: q(z, β, π, µ, κ) = q(z)q(β)q(π)q(µ)q(κ) For variables z, π, and µ, we introduce the variational parameters λ, ρ, ψ, and γ, such that(Gopal et all.,2014): q(zdn) = Mult(z|λdn) d = 1, · · · , D n = 1, · · · , N q(θd) = Dir(θd|ρ) d = 1, · · · , D q(µk) = vMF(µk|ψ, γ) k = 1, · · · , K
  • 19. Symbols 12 The update equation for these variational parameters are the following(Gopal et all.,2014): Rk = Eq[κk] N n=1 Eq[znk]xn + C0µ0 ψk = Rk Rk γk = Rk λdnk ∝ exp{Eq[logvMF(xdn|µk, κk)] + Eqlogθdk} logvMF(xn|µk, κk) = Eq[logCM (κk)] + Eq[κk]xT n Eq[µk] For the concentration parameter κk, it is impossible to do the variational inference be- cause there is no conjugate prior for the log-Normal function. Thus, we approximate its distribution function by sampling method. However, the sampling method needs the real samples generated by true conditional distribution, and there is only posterior distribution provided by variational parame- ters. But we can approximate the conditional distribution by the expectation of the posterior distribution(Gopal et all.,2014): P(κk|X, m, 2, µ0, C0, α) ∝ P(κk, X|m, σ2, µ0, C0, α) ≈ EqP(κk, X, Z, µ, κk, |m, σ2, µ0, C0, α) ≥ exp{EqlogP(κk, X, Z, µ, κk|m, σ2, µ0, C0, n)} ∝ exp{ N n=1 Eq[znk]logCM (κk) + κk N n=1 Eq[znk]xT n Eq[µk]} × logNormal(κk|m, σ2) (4.1) Therefore, we can use the Metropolis Hastings sampling with the log-Normal distri- bution as the proposal distribution.
  • 20. Chapter 5 Experiment Before, we do anything on the test set, we train on the online wikipedia set to get the value of the global variable topics. For LDA method, we update λ, which is the varia- tional parameters for the β through the training process. For vMF Clustering method, we obtain the word embedding vectors through the online wikipedia set. However, the variables µk, and κk which determines the topic distribution of k are not trained online previously. Instead, we need to estimate µd, and κd for each document on the test set. Combined with the topic distributions, the topic assignment of each test document can be achieved, using the topic proportions of each articles. All the conclusions have been shown below after topic assignment has been completed. 5.1 Topic Assignment of Words For each topic, we count the number of times that each word have been assigned to each topic. We choose the number of topic to be ten, and obtain the following list of the words that have been assigned to each topic. Among all documents in the test set, the words showing up in the list have the most count in each topic. Notice that such a list is not a topic. Instead, it is just the topic assignment for all articles in the test set, and these articles contain lots of ”junk words”, which have no semantic meaning, e.g ”www”, ”http”, ”ref”, and etc. 13
  • 21. Symbols 14 LDA Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 http ref http ref http ref www bank france date www date ref code ref date jordan world date www german time case language october july key category game november band work goals work band bank american language record page letter people london england ing case programming date national year Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 ref http ref ref ref www ref http http www http www www code world june date germany date http bank world date world states year swimming year swimming america file year world letter james music key work language year link national book time key july case band key case
  • 22. Symbols 15 vMF Clustering Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 letter high date history swimming books system world page record author power year text game book data time article player james object states press final david convert june journal top award support april version tour letter high july law university english model march life gold john energy october science major Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 case key code bank ref category work language national http left band list main www number music programming file pdf type group including international retrived single style small link index set led company global net results song languages center ing individual made class financial external called records format source em Compared with the two tables from the two methods, we see that some ”junk words” show up many times among different columns in LDA method. This is because there are lots of ”junk words” in those test articles. However, words show up at most once for a single column in vMF Clustering method. We can conclude that in vMF Clustering method, each word tends to be assigned to only one topic regardless of the context, while in LDA, each word can be assigned to different kinds of topics depending on the document it belongs to. Therefore, we can say that for vMF method, the topic assign- ment of word depends on the semantics of the word itself.
  • 23. Symbols 16 From these two tables, we see that the vMF Clustering method performs better, in some sense. This is because the vMF Clustering method are better at dealing with a large proportion of ”junk words” by assigning them all in a single topic. 5.2 Word Cooccurence In this context, the word coocurence is defined to be the two words with the same topic that are next to each other in the document. The word cooccurence defines a topic in a better sense because each word can have different meanings corresponding to different topics, and the two words occuring together take into account the factor of context. In the following two tables, I will show the cooccurence of most frequent word pairs within each topics using LDA and vMF method, respectively. Here is the result from the LDA:
  • 24. Symbols 17 Topic 1 Topic 2 Topic 3 Topic 4 (language,german) (michael,jordan) (http,www) (date,november) (main,page) (iso,iso) (player,award) (letter,case) (research,institute) (years,years) (scoring,goals) (state,single) (mail,date) (science,recursion) (score,goals) (top,level) (single,games) (work,time) (key,case) (num,date) (record,high) (santa,cruz) (system,cross) (flash,flash) (institute,date) (european,descent) (english,language) (holding,company (stock,exchange) (john,hughes) (mathematics,mathematics) (page,base) (short,forms) (link,date) (iso,alpha) (mixed,types) Topic 5 Topic 6 Topic 7 (world,economic) (retrieved,june) (http,www) (australia,years) (function,computing) (literature,literature) (credit,card) (science,data) (research,info) (return,return) (appearance,record) (run,time) (volume,issue) (classical,music) (land,area) (object,oriented) (power,forward) (page,base) (fat,cat) (hit,single) (string,format) (deal,location) (history,formation) (line,comments) (variables,variables) (million,copies) (typically,occurs) Topic 8 Topic 9 Topic 10 (date,janurary) (programming,language) (year,population) (july,date) (key,case) (http,www) (june,area) (letter,case) (programming,language) (year,swimming) (time,list) (key,case) (york,stock) (function,parameter) (time,top) (east,germany) (times,guide) (august,image) (january,established) (prototype,based) (date,release) (computer,graphics) (based,programming) (population,density) (computer,science) (computer,science) (letter,case)
  • 25. Symbols 18 Here is the resulf from vMF Clustering: Topic 1 Topic 2 Topic 3 Topic 4 (author,stephen) (power,parity) (date,january) (volume,issue) (stephen,thomas) (array,data) (date,april) (issue,pages) (bill,russell) (system,static) (date,october) (science,nature) (chris,martin) (increase,decrease) (date,november) (review,vol) (van,der) (rate,spikes) (date,august) (volume,page) (literature,literature) (growth,rate) (largest,city) (history,background) (harris,john) (air,force) (year,population) (journal,page) (author,fox) (high,rate) (date,december) (issue,page) (author,chan) (high,voltage) (london,england) (detail,article) Topic 5 Topic 6 Topic 7 (credit,card) (left,upright) (music,heavy) (university,students) (color,black) (song,records) (final,game) (function,parameter) (recording,sessions) (scoring,goals) (upright,left) (classical,music) (score,goals) (symbol,type) (pop,music) (player,diego) (upper,lip) (making,music) (width,diego) (considered,integral) (deal,work) (training,session) (lowest,point) (performance,style) (semi,final) (lower,alpha) (electronic,music) Topic 8 Topic 9 Topic 10 (programming,language) (development,report) (http,www) (class,file) (human,development) (index,index) (language,languages) (bank,cooperate) (em,external) (class,computer) (economic,community) (proc,arg) (stock,exchange) (research,institute) (r,pdf) (computer,programming) (location,map) (pdf,pdf) (iso,iso) (trade,organization) (uni,index) (based,programming) (management,financial) (dec,ai) (prototype,based) (national,bank) (comp,index)
  • 26. Symbols 19 Though two table shows different results of word pairs, we see that they both cap- ture the pairs that are representative of a topic. For the LDA method, the ”junk words” do not show up as frequently as the previous single word table. Therefore, we can con- clude that the cooccurence word table better represent a topic than single word table. 5.3 PMI Our final standard to evaluate the method of topic modeling is Pointwise Mutual Infor- mation(PMI). It is defined in the following formula: PMIk(wi, wj) = log pk(wi, wj) pk(wi)pk(wj) pk(wi, wj) refers to the probability of cooccurence of two words within topic k, p(wi) refers to the probability of word wi occuring in topic k, and p(wj) refers to the prob- ability of word wj occuring in topic k. The higher the PMI is, the better the topic model performs. This because the higher probability that two words next to each other assigned to the same topic, the better the topic model is. Here are the results of the highest PMI values for each topic between LDA and vMF Clustering methods: LDA
  • 27. Symbols 20 Topic 1: 9.14 8.74 8.66 8.04 7.94 7.53 7.31 7.10 7.06 7.00 Topic 2: 9.87 9.87 9.87 9.17 8.48 8.48 8.48 8.22 8.07 8.07 Topic 3: 9.20 9.20 9.01 8.79 8.28 7.94 7.92 7.81 7.81 7.73 Topic 4: 9.59 9.59 8.49 8.49 7.98 7.98 7.80 7.80 7.51 7.47 Topic 5: 9.16 8.88 8.84 8.47 8.25 8.25 7.46 7.29 7.22 7.07 Topic 6: 8.48 8.48 8.37 8.19 7.90 7.78 7.78 7.60 7.56 7.50 Topic 7: 9.62 9.03 9.03 8.74 8.63 8.11 8.11 7.93 7.83 7.79 Topic 8: 8.84 8.62 8.44 8.25 8.15 8.07 7.93 7.63 7.34 7.31 Topic 9: 9.66 9.01 8.60 7.95 7.65 7.56 7.46 7.30 7.09 7.08 Topic 10: 10.5 9.81 9.00 8.47 7.76 7.68 7.61 7.61 7.47 7.00 vMF Clustering Topic 1: 7.44 7.06 5.43 5.10 4.67 4.42 4.40 4.16 4.16 4.05 Topic 2: 6.92 6.71 6.64 5.87 5.82 5.80 5.78 5.72 5.72 5.54 Topic 3: 5.62 4.57 4.47 4.45 4.31 3.58 3.55 3.47 3.31 3.22 Topic 4: 5.68 5.47 5.10 5.06 4.20 4.15 3.87 3.36 3.36 3.35 Topic 5: 5.27 4.35 4.30 4.27 4.13 4.06 3.86 3.79 3.33 3.23 Topic 6: 8.86 7.08 6.25 6.18 6.15 6.03 5.96 5.82 5.51 5.30 Topic 7: 6.61 5.45 5.32 4.62 4.56 4.52 4.36 4.24 4.11 4.07 Topic 8: 5.29 5.06 4.46 4.46 4.43 4.42 4.36 4.08 4.04 3.94 Topic 9: 5.58 5.10 4.63 4.19 4.03 3.94 3.68 3.62 3.59 3.48 Topic 10: 9.02 8.45 7.75 7.58 5.33 4.99 4.83 4.15 3.91 3.68 From these two tables, we see that the PMI of LDA is higher than the PMI of sHDP. This shows that the LDA performs better than vMF Clustering in terms of PMI score.
  • 28. Chapter 6 Conclusion The LDA and vMF Clustering are two representatives of topic modeling methods. In terms of assigning a topic to a word, both method takes into account the two factors: the topic distribution for a vocabulary and the document-specific topic proportions. They both use the variational inference to approximate the distribution of the latent variables through online training. Compared with the LDA method, vMF Clustering considers the semantic meaning of each word, and thus, consider more about the word itself rather than the context of a specific document when doing the topic assignment. One thing we could make for sure is that the vMF Clustering are better at dealing with ”junk words”. 21
  • 29. Bibliography David M Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty editor. Latent Dirichlet Allocation Allocation, In Journal of Machine Learning Research 3 (2003) D. Blei, and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Classification, Clustering, and Applications. Chapman Hall/CRC Data Mining and Knowledge Discovery Series, 2009. Siddharth Gopal, Yiming Yang. Von Mises-Fisher Clustering Models, In Journal of Machine Learning Research, Workshop and Conference Proceedings, Vol 32 22