TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
graduate_thesis (1)
1. UNIVERSITY OF CHICAGO
Topic Modeling and its relation to Word
Embedding
by
Sihan Chen
A thesis submitted in partial fulfillment for the
degree of Master of Statistics
in the
Department of Statistics
July 2016
2. Declaration of Authorship
I, Sihan Chen, declare that this thesis titled, ‘Topic Modeling and its relation to
Word Embedding’ and the work presented in it are my own. I confirm that:
This work was done wholly or mainly while in candidature for a research degree
at this University.
Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
Where I have consulted the published work of others, this is always clearly at-
tributed.
Where I have quoted from the work of others, the source is always given. With
the exception of such quotations, this thesis is entirely my own work.
I have acknowledged all main sources of help.
Where the thesis is based on work done by myself jointly with others, I have made
clear exactly what was done by others and what I have contributed myself.
Signed:
Date:
i
3. UNIVERSITY OF CHICAGO
Abstract
Department of Statistics
Master of Statistics
by Sihan Chen
Topic Modeling has been a popular area of research for more than a decade. Each
vocabulary in a document is assigned to a topic, which is a distribution of words, and
thus, each article becomes a mixture of different kinds of topics. As a result, we can
group similar articles based on the topic assignment of their words. Even though a large
number of models has been applied to topic modeling, I mainly focus on two methods
of topic modeling: the Latent Dirichlet Allocation(LDA), and Von Mises-Fisher(vMF)
Clustering. The LDA model is based on the method Variational Bayes; the vMF Cluster-
ing, on the other hand, needs some extra information gained from Word Embedding. In
this paper, I will compare these two methods through the topic words, topic coocurence
words, and computing their Pointwise Mutual Information(PMI).
7. Symbols
β topic
z topic assignment
θ topic proportion
K the number of topics
k the topic subscript
D the number of documents in the corpus
d the document subscript
N the number of words in the documents
w the specific observed word
vi
8. Chapter 1
Introduction and Moltivation
Topic Modeling has been a popular area of research for more than a decade. It can also
be applied to many areas of industry, such as search engine, document classification, and
etc. The topic has now been defined as the ”distribution of vocabulary”. Generally,such
a distribution is discrete since the number of words in the vocabulary is finite. In some
other case, however, topics can be continuous. When the words have been reprensented
by word embedding vectors, topics can be defined as the distribution of these embedding
vectors. Topic is the global variables that is defined on the corpus level
Therefore, some extended concept, Topic Proportions, and Topic Assignment have been
defined and parameterized in the topic model. Topic Proportions is the proportion of
each topic in a specific document. Topic Assignment refers to the topic assigned to
each word(Early topic model assigns topics to document level). After a word has been
assigned to a topic, the probability of that word appearing within that topic will be
maximized. Article can be classified according to the topic proportions of key words.
What’s more, some new words that have never shown up before can be assigned to a
topic according to the topic proportions of the document.
This paper will introduce some basic topic models. It is mostly about the comparison
of two topic models: Latent Dirichlet Allocation(LDA), and Von Mises-Fisher(vMF)
Clustering.
1
9. Chapter 2
Development of Topic Models
The concept of ”topic” depends on the topic models. Its meaning keeps changing as
topic models has been growing mature for decades. This claim can be proved by some
early examples of topic models from below:
2.1 Mixture of Unigrams
This is an early version of topic model. Each article has been assigned to only one topic.
The distribution of each vocabulary depends on the topic assigned to the article. This
is a huge progress from the unigram model, where the distribution of each vocabulary
is fixed, independent of topics. The probability of the document is:
p(w) = Σzp(z)ΠN
n=1p(wn|z)
This model suffers the problem of overfitting, in the sense that it does not account for
the words never appear in the training set. The probability of the new word will be
exceedingly low, and thus, the perplexity of of this model will explode.
2
10. Symbols 3
2.2 Probabilistic Latent Semantic Indexing
Probabilistic Latent Semantic Indexing(pLSI) is a topic model that assigns the topic to
the level of each word. As a result, each article contains multiple topics. The probability
of a words is defined to be:
p(wn) = Σzp(wn|z)p(z|d)
This model can assign a positive probability to an unseen word based on the proportion
of topics within each document. However, this model does not solve the problem of
overfitting, either. Since d can be as many as the number of document in the training
set, the topic assignment procedure cannot learn anything that can be generalized to the
set other than the training one. Thus, there is no standard way of assigning topics to
the words among unseen documents. As a result, the number of parameters that needs
to be estimated will increase linearly as the size of the document.
11. Chapter 3
Latent Dirichlet Allocation
3.1 Introduction
Latent Dirichlet Allocation(LDA) is a Bayesian Hierarchical Model that treats observed
words as observed variables, and contains multiple latent variables: z-topic assignment,
β-topic, and θ-topic proportion. β is a discrete distribution over fixed vocabulary; z is a
multinomial distribution over the K topics; θ works as the latent variables of z, whose
value serves as the probability of assigning a specific topic on z. Compared with the
pLSI, LDA introduces a new latent variable for the topic assignment: by adding another
K −1 parameters for the topic proportions, it can be applied to any new documents out
of training set without introducing new parameters.
In addition, to make the posterior inference work, LDA introduces the Dirichlet prior η
for the latent variables β, and α for the latent variable θ. Let V be the size of vocabulary
and K be the size of the topics. Then, the prior of β is V dimensional with variable
η, denoted by DirV (η); the prior of θ is K dimensional with variable α, denoted by
DirK(α). The generative process is the following:
1. Choose an value for the V-dim and K-dim prior parameter η, α.
2. For each topic k, Draw βk ∼ DirV (η) k = 1, · · · , K
3. For each document d, Draw θd ∼ DirK(α) d = 1, · · · , D
4
12. Symbols 5
4. For each word n,
(a) Draw zd,n ∼ MultK(θd) zd,n = 1, · · · , K n = 1, · · · , N
(b) Draw wd,n ∼ MultV (βzd,n
) wd,n = 1, · · · , V n = 1, · · · , N
From this process, we can see that there are three levels of variables: the corpus level,
the document level, and the word level. The β is the corpus level variable, θ is the
document level variable, and w, z are the word level variable. After this generating
process, we need to find the posterior distribution level by level.
3.2 Posterior Distribution
Finding the posterior distibution of latent variables is the essential step to draw any
useful and practical conclusion from the document.
After we know the distribution of θd for each article, we can classify the similar article
based on the difference of θ. We can use the formula below to measure the document
similarity(Blei and Lafferty 2009):
document − similarityd,f =
K
k=1
( ˆθd,k − ˆθf,k)2
Besides, knowing the value of βk for each topics can help us understand what each topic
is about. By knowing the distribution of a specific βk, we are able to know the prob-
ability of each word appearing in this topic, and thus, the most frequent words within
a topic. This notion is very helpful for all kinds of search engine: it will show up the
articles and website that are related to the key words, based on the frequency of this
key words appearing in different kinds of topics, not only the articles and website that
contain such key words.
Most importantly, after we know the value of θ, and β, we are able to gain the prob-
ability of topic assignment zd,n for a specific word within a document. Intuitively, the
probability of assigning topic k to zd,n is proportional to the topic proportion θd for k,
times probability of word wd,n appearing within topic k, which is the value for word wd,n
13. Symbols 6
in βk. The relationship is the following:
p(zn|z−n, w) = p(wn|β1:K) ·
θ
p(zn|θ)p(θ|z−n)dθ
However, we can see that it takes huge computation to get zn, since θ is K dimensional
and we need to integrate K times to get the result. If we consider the joint distribution
of βk, θd, zn, which is:
p(β1:K, θ1:D, z1:N |w1:D,1:N ) =
p(β1:K, θ1:D, z1:D, w1:D)
β1:K θ1:D z1:D
p(β1:K, θ1:D, z1:D, w1:D)
It becomes impossible to compute. Therefore, we need to come up with some other
method to estimate the latent variables z, β, and θ.
3.3 Variational Inference
To solve the computational problem of standard method of posterior inference, we need
to find some other way to approximate the distribution of latent variables. In this
project, I choose the variational inference method.
First of all, the variational parameters has to be introduced for the object variables
β, θ, and z. Let λ, γ, and φ to be the parameters of β, θ, and z, respectively. Then,
we can denote the variational distributin of β, θ, and z to be q(β|λ), q(θ|γ), and q(z|φ).
Therefore, we have the following relationship:
q(β|λ) = DirV (λ); q(θ|γ) = DirK(γ); q(z|φ) = Mult(φ)
As a result, the object three variables β, θ, and z become conditionally independent. Its
joint distribution just becomes:
q(β1:K, θ1:D, z1:D,1:N ) =
K
k=1
q(βk|λk)
D
d=1
(q(θd|γd)
N
n=1
q(zd,n|φd,n))
14. Symbols 7
In order to make the joint distribution as close to the true distribution, we need to
minimize the Kullback-Leibler(KL) distance by finding the optimal variational parame-
ters(Blei and Lafferty 2009):
argγ1:D,λ1:K ,φ1:D,1:N
minKL(q(θ1:D, β1:K, z1:D,1:N )||p(θ1:D, β1:K, z1:D,1:N ))
Now, we define the Evidence Lower Bound to be:
L(w, φ, γ, λ) := Eq[logp(w, z, θ, β)] − Eq[logq(z, θ, β)]
Since we have this relation:
logp(w|α, η) = log
θ β z
p(θ, β, z, w|α, η)dθdβ
= log
θ β z
p(θ, β, z, w|α, η)q(θ, β, z)
q(θ, β, z)
dθdβ
≥
θ β z
q(θ, β, z)logp(θ, β, z, w|α, η)dθdβ −
θ β z
q(θ, β, z)logq(θ, β, z)dθdβ
= Eqlogp(θ, β, z, w|α, η) − Eqlogq(θ, β, z)
= L(w, φ, γ, λ)
and:
KL(q(θ.β, z)||p(θ.β, z)) =
θ β z
q(θ, β, z)log
q(θ, β, z)
p(θ, β, z)
dθdβ
=
θ β z
q(θ, β, z)logq(θ, β, z)dθdβ −
θ β z
q(θ, β, z)logp(θ, β, z)dθdβ
= Eqlogq(θ, β, z) − Eqlogp(θ, β, z)
Therefore, we see that:
L(w, φ, γ, λ)+KL(q(θ.β, z)||p(θ.β, z)) = Eqlogp(θ, β, z, w|α, η)−Eqlogp(θ, β, z) = logp(w|α, η).
Thus, maximizing L(w, φ, γ, λ) is equivalent to minimizing the KL distance.
15. Symbols 8
Now, let us factorize the equation L(w, φ, γ, λ) in the form (Hoffman et all 2010):
L(w, φ, γ, λ) =
d
{Eq[logp(wd|θd, zd, β)] + Eq[logp(zd|θd)] − Eq[logq(zd)] + Eq[logp(θd|α)]
− Eq[logq(θd)] − (Eq[logp(β|η)] + Eq[logq(β)])/D}
=
d w
Ndw
k
φdwk(Eq[logθdk] + Eq[logβkw] − logφdwk)
− logΓ(
k
γdk) +
k
(α − γdk)Eq[logθdk] + logΓ(λdk)
+ (
k
−logΓ(
w
λkw) +
w
(η − λkw)Eq[logβkw] + logΓ(λkw))/D
= logΓ(Kα) − KlogΓ(α) + (logΓ(Wη) − WlogΓ(η))/D
:=
d
l(Nd, φd, λd, λ)
Taking the derivative with respect to φdwk, γdk, and λkw, and set them to 0, we get
the following update relation for these parameters:
φdwk ∝ exp{Eq[log(θdk)]+Eq[logβkw]} γdk = α+
w
Ndwφdwk λkw = η+
d
Ndwφwdk
It is tempting to think about update φdwk, γdk, and λkw for each document d. However,
since the number of ducument is huge online, this update still takes much computations.
There is another algorithm solving this problem by considering all documents as a single
document. Then, it simply repeats D times, where each time, the number of counts of
words are the same. Here is the following algorithms(Hoffman et all 2010):
OnlineLDA:
Define ρt = (τ0 + t)−κ
Initialize λ randomly.
for t= 0 to D do:
E step:
initialize γtk = 1. (The constant 1 is arbitrary).
repeat
Set φtwk ∝ exp{Eq[logθtk] + Eq[logβkw]}
Set γtk = α + w φtwkNtw
16. Symbols 9
until 1
K k change inγtk < 0.00001
M step:
Compute ˜λkw = η + DNtwφtwk
Set λ = (1 − ρt)λ + ρt
˜λ
end for
Therefore, the number of times word w appearing in the combined document, Ntw,
is the same for all t.
After we obtain the topic parameters λkw, we start to find the other document-specific
parameters γdk, and φdwk on the test set. We use the following algorithm(Hoffman et
all 2010):
Batch LDA:
Choose the λ we obtain from OnlineLDA
while relative improvement in L(w, φ, γ, λ) > 0.00001 do
for d= 0 to D do:
Initialize γtk = 1. (The constant 1 is arbitrary).
repeat
Set φdwk ∝ exp{Eq[logθdk] + Eq[logβdw]}
Set γdk = α + w φdwkNdw
until 1
K k change inγdk < 0.00001
end for
end while
After we get these latent variational parameters, we are able to approximate the la-
tent variables β, θ, and z. Now we are able to assign topics on words in the document.
17. Chapter 4
Von Mises-Fisher Clustering
4.1 Introduction
As a topic model, Latent Dirichlet Allocation model does not take into account the
semantic meaning of each vocabulary. To solve this problem, a new model called Von
Mises-Fisher(vMF) Clustering has been used to do topic modeling. This model is actu-
ally a generalized LDA model, where the topic becomes a continuous distribution, while
the topic proportions are still drawn from the Von Mises-Fisher(vMF) Clustering.
Each topic is defined to be a point on a generalized unit sphere. For topic k, it chooses
µk as a center; then it chooses a variance parameter so that each word , assigned topic
k, is generated by a Von Mises-Fisher distribution(vMF) with dispersion parameters κk,
and center µk. The µk also has a vMF prior distribution with parameters µ0, and C0.
The distribution of xdn given µk, and κk is given by:
f(xdn; µk, κk) = exp(κkµk
t
xdn)CM (κk)
where M is the dimension of the word embedding vector, and CM (κk) is the normalizing
constant, which equals to:
κ0.5D−1
k
(2π)0.5DI0.5D−1(κk)
, and Iv(α) is the modified Bessel function.
The Topic proportions parameter θ is still drawn from the same process as LDA. We
10
18. Symbols 11
get the following generative process for vMF clustering(Gopal et all.,2014):
θd ∼ Dirichlet(α) d = 1, · · · , D
κk ∼ log-Normal(m, σ2
) k = 1, · · · , K
µk ∼ vMF(µ0, C0) k = 1, · · · , K
zdn ∼ Mult(θd) d = 1, · · · , D n = 1, · · · , N
xdn ∼ vMF(µzdn
, κzdn
) d = 1, · · · , D n = 1, · · · , N
4.2 Parameter Inference
Just as the LDA method, it is impossible to directly find the posterior distribution of
the latent variables from such formula:
p(z1:D,1:N , θ1:D, µ1:K, κ1:K|w1:D) =
p(z1:D,1:N , θ1:D, µ1:K, κ1:K, w1:D)
κ1:K µ1:K π1:D z1:D,1:N
p(z1:D,1:N , θ1:D, µ1:K, κ1:K, w1:D)
One way to approximate the posterior distribution is using the variational inference.
Again, we denote the variational distribution by q. First of all, we introduce the varia-
tional parameters for each latent variables to break down the dependency of these latent
variables:
q(z, β, π, µ, κ) = q(z)q(β)q(π)q(µ)q(κ)
For variables z, π, and µ, we introduce the variational parameters λ, ρ, ψ, and γ, such
that(Gopal et all.,2014):
q(zdn) = Mult(z|λdn) d = 1, · · · , D n = 1, · · · , N
q(θd) = Dir(θd|ρ) d = 1, · · · , D
q(µk) = vMF(µk|ψ, γ) k = 1, · · · , K
19. Symbols 12
The update equation for these variational parameters are the following(Gopal et all.,2014):
Rk = Eq[κk]
N
n=1
Eq[znk]xn + C0µ0
ψk =
Rk
Rk
γk = Rk
λdnk ∝ exp{Eq[logvMF(xdn|µk, κk)] + Eqlogθdk}
logvMF(xn|µk, κk) = Eq[logCM (κk)] + Eq[κk]xT
n Eq[µk]
For the concentration parameter κk, it is impossible to do the variational inference be-
cause there is no conjugate prior for the log-Normal function. Thus, we approximate its
distribution function by sampling method.
However, the sampling method needs the real samples generated by true conditional
distribution, and there is only posterior distribution provided by variational parame-
ters. But we can approximate the conditional distribution by the expectation of the
posterior distribution(Gopal et all.,2014):
P(κk|X, m, 2, µ0, C0, α) ∝ P(κk, X|m, σ2, µ0, C0, α)
≈ EqP(κk, X, Z, µ, κk, |m, σ2, µ0, C0, α)
≥ exp{EqlogP(κk, X, Z, µ, κk|m, σ2, µ0, C0, n)}
∝ exp{
N
n=1
Eq[znk]logCM (κk) + κk
N
n=1
Eq[znk]xT
n Eq[µk]}
× logNormal(κk|m, σ2)
(4.1)
Therefore, we can use the Metropolis Hastings sampling with the log-Normal distri-
bution as the proposal distribution.
20. Chapter 5
Experiment
Before, we do anything on the test set, we train on the online wikipedia set to get the
value of the global variable topics. For LDA method, we update λ, which is the varia-
tional parameters for the β through the training process. For vMF Clustering method,
we obtain the word embedding vectors through the online wikipedia set. However, the
variables µk, and κk which determines the topic distribution of k are not trained online
previously. Instead, we need to estimate µd, and κd for each document on the test set.
Combined with the topic distributions, the topic assignment of each test document
can be achieved, using the topic proportions of each articles. All the conclusions have
been shown below after topic assignment has been completed.
5.1 Topic Assignment of Words
For each topic, we count the number of times that each word have been assigned to
each topic. We choose the number of topic to be ten, and obtain the following list of
the words that have been assigned to each topic. Among all documents in the test set,
the words showing up in the list have the most count in each topic. Notice that such
a list is not a topic. Instead, it is just the topic assignment for all articles in the test
set, and these articles contain lots of ”junk words”, which have no semantic meaning,
e.g ”www”, ”http”, ”ref”, and etc.
13
21. Symbols 14
LDA
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
http ref http ref http
ref www bank france date
www date ref code ref
date jordan world date www
german time case language october
july key category game november
band work goals work band
bank american language record page
letter people london england ing
case programming date national year
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
ref http ref ref ref
www ref http http www
http www www code world
june date germany date http
bank world date world states
year swimming year swimming america
file year world letter james
music key work language year
link national book time key
july case band key case
22. Symbols 15
vMF Clustering
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
letter high date history swimming
books system world page record
author power year text game
book data time article player
james object states press final
david convert june journal top
award support april version tour
letter high july law university
english model march life gold
john energy october science major
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
case key code bank ref
category work language national http
left band list main www
number music programming file pdf
type group including international retrived
single style small link index
set led company global net
results song languages center ing
individual made class financial external
called records format source em
Compared with the two tables from the two methods, we see that some ”junk words”
show up many times among different columns in LDA method. This is because there
are lots of ”junk words” in those test articles. However, words show up at most once for
a single column in vMF Clustering method. We can conclude that in vMF Clustering
method, each word tends to be assigned to only one topic regardless of the context,
while in LDA, each word can be assigned to different kinds of topics depending on the
document it belongs to. Therefore, we can say that for vMF method, the topic assign-
ment of word depends on the semantics of the word itself.
23. Symbols 16
From these two tables, we see that the vMF Clustering method performs better, in
some sense. This is because the vMF Clustering method are better at dealing with a
large proportion of ”junk words” by assigning them all in a single topic.
5.2 Word Cooccurence
In this context, the word coocurence is defined to be the two words with the same topic
that are next to each other in the document. The word cooccurence defines a topic in a
better sense because each word can have different meanings corresponding to different
topics, and the two words occuring together take into account the factor of context. In
the following two tables, I will show the cooccurence of most frequent word pairs within
each topics using LDA and vMF method, respectively.
Here is the result from the LDA:
26. Symbols 19
Though two table shows different results of word pairs, we see that they both cap-
ture the pairs that are representative of a topic. For the LDA method, the ”junk words”
do not show up as frequently as the previous single word table. Therefore, we can con-
clude that the cooccurence word table better represent a topic than single word table.
5.3 PMI
Our final standard to evaluate the method of topic modeling is Pointwise Mutual Infor-
mation(PMI). It is defined in the following formula:
PMIk(wi, wj) = log
pk(wi, wj)
pk(wi)pk(wj)
pk(wi, wj) refers to the probability of cooccurence of two words within topic k, p(wi)
refers to the probability of word wi occuring in topic k, and p(wj) refers to the prob-
ability of word wj occuring in topic k. The higher the PMI is, the better the topic
model performs. This because the higher probability that two words next to each other
assigned to the same topic, the better the topic model is. Here are the results of the
highest PMI values for each topic between LDA and vMF Clustering methods:
LDA
28. Chapter 6
Conclusion
The LDA and vMF Clustering are two representatives of topic modeling methods. In
terms of assigning a topic to a word, both method takes into account the two factors: the
topic distribution for a vocabulary and the document-specific topic proportions. They
both use the variational inference to approximate the distribution of the latent variables
through online training. Compared with the LDA method, vMF Clustering considers
the semantic meaning of each word, and thus, consider more about the word itself rather
than the context of a specific document when doing the topic assignment. One thing we
could make for sure is that the vMF Clustering are better at dealing with ”junk words”.
21
29. Bibliography
David M Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty editor. Latent
Dirichlet Allocation Allocation, In Journal of Machine Learning Research 3 (2003)
D. Blei, and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text
Mining: Classification, Clustering, and Applications. Chapman Hall/CRC Data Mining
and Knowledge Discovery Series, 2009.
Siddharth Gopal, Yiming Yang. Von Mises-Fisher Clustering Models, In Journal of
Machine Learning Research, Workshop and Conference Proceedings, Vol 32
22