M.S. in Data Science - Topic Modeling Challenges

M.S. in Data Science - Writing Sample
Topic Modeling
A Wikipedia article is long and descriptive write up that is often seen structured in different sub headings.
On the other hand, a news article focuses only on the critical issues that are important to the readers. Such
drastic dissimilarity is just one of the many challenges to Topic Models. Other small, yet significant
challenges may be the article size, depth of the topic, and intended audience. Obviously, different
applications in Topic Modeling are based on different levels of topic supervision. Of many state-of-the-art
approaches for Topic modeling, the most widely used are Probabilistic Latent Semantic Analysis (PLSA)
and Latent Dirichlet Allocation (LDA). These are unsupervised techniques that do not need labeled training
data. In these approaches each document may be viewed as a mixture of various topics and each topic may
be viewed as a mixture of various words. For example, an LDA model may have topics that can be classified
as ‘CAT_related’ and ‘DOG_related.’ A topic has probabilities of generating various words, such as milk,
meow, and kitten, which can be classified and interpreted by the viewer as ‘CAT_related.’ Naturally, the
word ‘cat’ itself will have high probability given this topic. [Excerpt taken from Wikipedia]. However, these
approaches need a human (user) to label the topics, for example, for a topic with a collection of words such
as meow, milk and kitten, a human would label the topic as ‘CAT_related’. For an application that requires
narrow supervision on topics, for example instead of labeling a topic as ‘Music,’ if one requires supervision
on the ‘Pop,’ ‘rock’ or ‘heavy metal’ types of ‘music,’ manual labeling becomes a daunting task. Moreover,
humans are prone to errors.
For our application, we use a semi-supervised approach in which we take a set of 5000 topics defined by
Prismatic. Then, we extract Wikipedia content for these 5000 topics to use as a training sample. We use
general rules and a multi-level scoring approach to gain high precision in classification of the topics. The
paper discusses the approach taken for Topic Modeling specific to our application.
2. Overview of TF-IDF weighting
Term frequency–Inverse Document Frequency, a numerical statistic, reflects how important a word is to a
document in a collection/corpus. TF-IDF determines the relative frequency of a term in a document
compared to the inverse proportion of that term over the entire document corpus [excerpt taken from Juan
Ramos’ paper on TF-IDF].
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹
𝑇𝐹 = 1 + log(𝑓(𝑡, 𝑑))
𝐼𝐷𝐹 = 1 + log10(𝐷/𝑓(𝑡, 𝐷))
Where f (t, d) is the frequency of the term (t) in a document (d), D is the size of the corpus, and f (t, D) is the
number of documents in which the term (t) occurs. Assuming that the frequency of the term ‘presid unit
state’ is 4 in the document d and is present in 10 documents out of a total of 5000 documents, the TF-IDF
weighting for the term ‘presid unit state’ will be,
TF-IDF = (1 + log (4)) * (1 + log10 (5000/10)) = 8.8268

3. Overview of Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors. It is a widely used technique to
compute the similarity between two documents, where each document is a vector of TF-IDF weights.
Fig 1.0 Vector space Model
Figure 1.0 shows a two dimensional vector space where d1 and d2 are documents and q is the query. The
distance between query q and the document d2 is very large when compared to the distance between the
query q and document d1. However, the distribution of terms in the query q is very similar to the distribution
of terms in document d2. Cosine similarity finds the angle between the query and the document rather than
finding the distance between them.
𝑆𝑖𝑚 =
𝑞. 𝑑
|𝑞||𝑑|
=
𝑞
|𝑞|
.
𝑑
|𝑑|
Where (q.d) is the dot product of query q and document d and
𝑞
|𝑞|
and
𝑑
|𝑑|
are unit vectors.
4. Extracting and storing Wikipedia contents
Wikipedia makes extraction of relevant keywords and phrases relatively easy by providing them in bold,
italics and hyperlink styles. However, being a knowledge hub, terms defined in Wikipedia articles are
different from those in news articles. For instance, a Wikipedia article on President Barack Obama has terms
such as ‘44th
President of the United States,’ whereas a news article might just have ‘President of the United
States.’ We use different regex operations to extract the relevant terms from a Wikipedia article. As a
preprocessing step we collect n-grams (especially for the terms with a preposition) for each terms, perform
word level stemming and stop-words removal. The phrase, ‘44th
President of the United States’ after
preprocessing becomes a collection of terms such as ‘44th
,’ ‘presid,’ ‘unit,’ ‘state,’ ‘presid unit state,’ ‘unit
state,’ etc. The preprocessing step results in a very large dictionary (A dictionary is a collection of all unique
terms). In-order to reduce the size of the dictionary, we write general rules to remove irrelevant terms. For
instance, all the terms having more than five words or without a character or terms that occur only once in
the dictionary are removed. We store all the Wikipedia extracted content in a ‘Topic Corpus,’ where each
row is a document and each document is a collection of term IDs and their frequency of occurrence in the
document. A Topic corpus with three documents looks like:
Doc ID Topic Corpus
Doc 1 [(0,2), (1,3), (2,4), (3,1), (4,1)]
Doc 2 [(0,1), (1,2), (4,2), (5,1), (6,2), (7,5), (8,2)]
Doc 3 [(2,2), (4,1), (6,3), (7,2), (9,2)]
Doc ID Topic
Doc 1 Topic 1
Doc 2 Topic 2
Doc 3 Topic 3

In the above tables, (0, 2) and (1, 3) are tuples, where 0 is the term ID and 2 is the frequency of the term
with ID 0 in Document 1.
5. Finding the First Set of Topics
A Topic Corpus is a collection of the term IDs and their frequency of all Wikipedia documents where each
document has a topic label. For a news article, we preprocess the article with the same steps that we used for
preprocessing the Wikipedia documents. Additionally, we implement few other preprocessing steps to
capture relevant phrases. Then, we provide TF-IDF weights to each term extracted from the news article.
For all the documents in the Topic Corpus, we provide TF-IDF weights to the terms and calculate their
similarity score with the news article using cosine similarity. We then store the top 12 topics with their
relative score for further processing.
The output of the top 12 topics looks like,
{‘Topic1’: ‘score1,’ ‘Topic2’: ‘score2,’……… ‘Topic12’: ‘score12’}
After determining the first set of topics, we store the news article in a News Corpus where each row is a
document and each document is a collection of ‘term ID’ and its ‘frequency’ of occurrence in the document.
Limitations of TF-IDF weighting: The TF-IDF model treats terms and documents independently. For
example, the words ‘snake’ and ‘python’ are treated independently in TF-IDF even though the words
‘snake’ and ‘python’ have strong relationship in the real world.
For a query, 'Python is a huge snake. I saw a Python yesterday,' the top 5 classified topics using TF-IDF
weighting and their similarity score with the query is given as:
Topics Python – Prog.
Language
Django (Web
Framework)
CoffeeScript CUDA Snake
Score 0.13 0.0943 0.0908 0.09019 0.088
Table 1.0: Topics and their relative scores using TF-IDF weighting
The topics Python (programming language), Django (Web framework), CoffeeScript and CUDA have
higher similarity score with the query than the topic snake, despite the query talks about python as a snake.
This is because of the very high frequency of the word ‘python’ in the top four documents and relatively
very low frequency of the words ‘python’ and ‘snake’ in the topic snake. This is a general problem that
occurs with TF-IDF weighting. In order to solve this problem we use a second scoring technique, where we
redistribute weights to the terms based on their relevance in a Topic Cluster.
6. Redistribution of Weights using Topic cluster
The two purposes of redistributing weights using topic cluster are:
 To determine the new scores of the topics based of the relevance of a term in a Topic Cluster.

 To discover new relevant terms that are not present in the classified topic but have very high weights
in the Topic cluster. This approach can be used to augment the term list for a given topic.
6.1 Topic Cluster: A Topic Cluster is a collection of similar documents with the topic as the centroid. We
use Cosine similarity measure to determine the similarity between the documents. For example, for the topic
‘Snake,’ articles talking about ‘Reptiles’ or ‘Mammals’ would fall under its Topic cluster. Similarly, for the
topic Python (programming language), articles on programming languages or related topics would fall under
its Topic cluster.
Fig 1.1: Topic Custer of Topic Snake Fig 1.2 Topic cluster of Topic Python (Prog. Language)
6.2 Determining new scores: We use the formula given below, to calculate the relevance of the terms in a
topic cluster and to determine the new score of the topics for the given query terms.
The relevance of term (t) in the Topic Cluster (Tc) for topic (T) is given as:
𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 (𝑡) = 𝑃(𝑡|𝑇𝑐) ∗ log( ∑ (𝑃(𝑡|𝑇′
))−1
𝑇′ ∈ 𝑇𝑐
′
The new score of the Topic (T) with the Query q is given as:
𝑁𝑒𝑤 𝑇𝑜𝑝𝑖𝑐 𝑆𝑐𝑜𝑟𝑒 (𝑇) = log ∑ 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 (𝑡)
𝑡 ∈𝑞
For a topic T, t is a query term in that topic, Tc is the topic cluster formed by the topic T and 𝑇𝑐
′
is the set of
all other topic clusters.
𝑃(𝑡|𝑇𝑐) =
𝑛𝑜. 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑇𝑜𝑝𝑖𝑐 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑇𝑐
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑇𝑜𝑝𝑖𝑐 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑇𝑐
The first probability measure 𝑃(𝑡|𝑇𝑐) gives high weight to the term that occur in more number of documents
in a Topic Cluster and the probability measure ∑ (𝑃(𝑡|𝑇′
))−1
𝑇′ ∈ 𝑇𝑐
′ reduces the weight of the terms that occur
across many Topic Clusters. For the query ‘Python is a huge snake. I saw a Python yesterday,’ the term
‘snake’ receives higher weight as it is only present in the Topic cluster of snake, whereas the term ‘python’
receives relatively lesser weight because it is present in almost all the topic cluster (snake, Python-Prog
language, Django, CoffeeScript, CUDA and others).

For the query 'Python is a huge snake. I saw a Python yesterday.' the top 5 topics and their scores using the
second scoring measure are given as,
Topics Snake Python (Prog.
Language)
Ruby (Prog.
Language)
CoffeeScript CUDA
Score 1.91 0.92 0.836 0.759 0.7542
Table 1.1: Topics and their relative scores using Topic Cluster
6.3 Discovering relevant terms: After we determine the weights of all the terms of a topic using the steps
defined in section 6.2, we collect all the terms that have high weight (exceeding a certain threshold level)
and are not present in the topic. For example, if the term ‘venomous’ is not present in the term list of Topic
‘Snake,’ but has a very high weight in the Topic cluster, then we say that the term ‘venomous’ has a strong
relationship with topic ‘Snake.’ And therefore, we augment the topic ‘Snake’ with the new term. This step is
only done for top two topics upon user verification.
The Clustering process is a time taking process, therefore, it is scheduled as a background job.
7. Calculating final scores
We collect the scores obtained from both TF-IDF scoring and the second scoring approach and simply
multiply them to obtain the final scores of the topics. The final score of the top 5 topics for the query 'Python
is a huge snake. I saw a Python yesterday.' Is:
Topics Snake Python (Prog.
Language)
Ruby (Prog.
Language)
CoffeeScript CUDA
Score 0.169 0 0.128 0.0716 0.069 0.068
Table 1.2: Topics and their final scores
8. Experiment Results and Findings
We ran the whole process for more than 5000 news articles and manually evaluated a set of 100 articles
belonging to four different categories.
Category 1: The first set of articles we evaluated were about Facebook, Mark Zuckerberg and the ‘Dislike’
button. For almost 50% of the document the TF-IDF scoring classified the topic ‘Button’ with highest score,
mainly because of the very high frequency of the term ‘button’ in the news articles and shorter length of the
Wiki article on topic ‘Button’. On the other hand, the cumulative score scored the topic “Mark Zuckerberg”,
“Facebook” and “Social Media” higher than the topic “Button”.
For a group of articles that discussed about the introduction of ‘Dislike’ button and how this option can
elevate Cyberbullying; TF-IDF performed poorly by providing the topic ‘Cyberbullying’ with lesser score
than other trivial topics. On the other hand, the cumulative approach scored Cyberbullying high and ranked
it in the top 3.

Category 2: The second set of articles we evaluated were mostly financial and investment related. For
almost all articles with moderate content, the cumulative scoring approach classified first three topics with
precision. The misclassified articles mainly fell under the category of smaller articles with words ranging
from 50-120 and without relevant content. The top four topics in most cases were “Investments”, “Finance”,
“Investment Banking” and “Security (finance)”.
Category 3: Our third set of articles were about the new image of Pluto by NASA. These articles were
abundant with relevant Entities, such as Pluto, NASA, New horizon, Dwarf planet and others. Since the
articles were rich in content, both the scoring approaches classified the topics well. However the topic
NASA for roughly 30% of the articles, was ranked in the range of 4-7, mainly because the terms in the topic
NASA received both, a low IDF weight and a low Topic cluster weight.
Category 4: The fourth set was randomly selected news articles and we received approximately 85% of the
articles to be classified to the user’s satisfaction.
For the articles with good content and term presence, TF-IDF did a good job in classifying the topics,
however for articles with mixture of many topics where the frequency of the terms pertaining to the topics
were less, TF-IDF failed to capture the topic. For almost every article, the cumulative approach did better
than general TF-IDF in ranking the topics by their relevance.
The classification using the cumulative approach mainly suffered by three reasons:
1. The Wikipedia contents were not very news friendly. We believe augmenting the Topic corpus with
relevant terms extracted from news articles might help in tackling this problem.
2. Articles very small in size with fewer terms in common, scored more than articles larger in length with
more terms in common.
3. Shortage of topics posed another problem.
8.1 Clustering on both, the topic corpus and the news corpus: After we had a set of classified news articles
in the News corpus, we ran the clustering job, merging both the Topic corpus (Wikipedia articles) and the
News corpus and evaluated the new clusters formed. It was found that many news articles glued to the Topic
Clusters of the topics to which they were similar.
Fig 1.3: Topic Custer of Topic Mark Zuckerberg Fig 1.4: Topic cluster of Topic Mark Zuckerberg
Clustered on Wiki Articles clustered on Wiki Articles and News articles

Fig 1.3 shows the topic cluster of the Topic ‘Mark Zuckerberg’ before news articles were involved in
clustering. There are only few documents in the cluster. Fig 1.4 shows the topic cluster after news articles
were involved in clustering. Here, the number of documents in the cluster has increased.
The increase in cluster size with more news article may help refine the process of discovering relevant terms
discussed in the section 6.3. As the frequency of the news articles increases in the Topic Cluster, the chance
of getting relevant terms from the news article also increases. Additionally, augmenting the topics with these
terms will improve the classification precision in the future.
Our experiments on the new Topic cluster formed by the topic Mark Zuckerberg defined the terms ‘CEO
Mark Zuckerberg’, ‘founder’, ‘chief’, ‘like button’, ‘dislike button’, ‘idea’, ‘Facebook status’, ‘companies’,
‘realist’ and others as relevant to the topic Mark Zuckerberg. We believe these terms, if added to the topic
Mark Zuckerberg would increase our classification precision in the future.
9. Determining new topics
We are currently working on different set of rules to discover new potential topics. For starters, we collect
all the terms in quotes and all the Named Entities that have high weight in a Topic Cluster. Once a potential
topic is verified by a user, we fetch the Wikipedia contents for the topic and store it in the Topic corpus.
10. Conclusion
In this paper we present a multi-scoring approach using the TF-IDF weighting and term weighting based on
their relevance in a Topic cluster. We then present the drawback of TF-IDF weighting and how the second
scoring approach assists in improving the classification. Experiments conducted with different sets of news
articles show that the cumulative scoring results in better classification in almost every case. Finally, we
present an intuition and a scenario on how the term weighting in Topic clusters can assist in discovering new
relevant terms for a Topic.

M.S. in Data Science - Topic Modeling Challenges

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (11)

Similar to M.S. in Data Science - Topic Modeling Challenges

Similar to M.S. in Data Science - Topic Modeling Challenges (20)

M.S. in Data Science - Topic Modeling Challenges