SlideShare a Scribd company logo
1 of 7
Download to read offline
M.S. in Data Science - Writing Sample
Topic Modeling
A Wikipedia article is long and descriptive write up that is often seen structured in different sub headings.
On the other hand, a news article focuses only on the critical issues that are important to the readers. Such
drastic dissimilarity is just one of the many challenges to Topic Models. Other small, yet significant
challenges may be the article size, depth of the topic, and intended audience. Obviously, different
applications in Topic Modeling are based on different levels of topic supervision. Of many state-of-the-art
approaches for Topic modeling, the most widely used are Probabilistic Latent Semantic Analysis (PLSA)
and Latent Dirichlet Allocation (LDA). These are unsupervised techniques that do not need labeled training
data. In these approaches each document may be viewed as a mixture of various topics and each topic may
be viewed as a mixture of various words. For example, an LDA model may have topics that can be classified
as ‘CAT_related’ and ‘DOG_related.’ A topic has probabilities of generating various words, such as milk,
meow, and kitten, which can be classified and interpreted by the viewer as ‘CAT_related.’ Naturally, the
word ‘cat’ itself will have high probability given this topic. [Excerpt taken from Wikipedia]. However, these
approaches need a human (user) to label the topics, for example, for a topic with a collection of words such
as meow, milk and kitten, a human would label the topic as ‘CAT_related’. For an application that requires
narrow supervision on topics, for example instead of labeling a topic as ‘Music,’ if one requires supervision
on the ‘Pop,’ ‘rock’ or ‘heavy metal’ types of ‘music,’ manual labeling becomes a daunting task. Moreover,
humans are prone to errors.
For our application, we use a semi-supervised approach in which we take a set of 5000 topics defined by
Prismatic. Then, we extract Wikipedia content for these 5000 topics to use as a training sample. We use
general rules and a multi-level scoring approach to gain high precision in classification of the topics. The
paper discusses the approach taken for Topic Modeling specific to our application.
2. Overview of TF-IDF weighting
Term frequency–Inverse Document Frequency, a numerical statistic, reflects how important a word is to a
document in a collection/corpus. TF-IDF determines the relative frequency of a term in a document
compared to the inverse proportion of that term over the entire document corpus [excerpt taken from Juan
Ramos’ paper on TF-IDF].
𝑇𝐹 − 𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹
𝑇𝐹 = 1 + log(𝑓(𝑡, 𝑑))
𝐼𝐷𝐹 = 1 + log10(𝐷/𝑓(𝑡, 𝐷))
Where f (t, d) is the frequency of the term (t) in a document (d), D is the size of the corpus, and f (t, D) is the
number of documents in which the term (t) occurs. Assuming that the frequency of the term ‘presid unit
state’ is 4 in the document d and is present in 10 documents out of a total of 5000 documents, the TF-IDF
weighting for the term ‘presid unit state’ will be,
TF-IDF = (1 + log (4)) * (1 + log10 (5000/10)) = 8.8268
3. Overview of Cosine Similarity
Cosine similarity measures the cosine of the angle between two vectors. It is a widely used technique to
compute the similarity between two documents, where each document is a vector of TF-IDF weights.
Fig 1.0 Vector space Model
Figure 1.0 shows a two dimensional vector space where d1 and d2 are documents and q is the query. The
distance between query q and the document d2 is very large when compared to the distance between the
query q and document d1. However, the distribution of terms in the query q is very similar to the distribution
of terms in document d2. Cosine similarity finds the angle between the query and the document rather than
finding the distance between them.
𝑆𝑖𝑚 =
𝑞. 𝑑
|𝑞||𝑑|
=
𝑞
|𝑞|
.
𝑑
|𝑑|
Where (q.d) is the dot product of query q and document d and
𝑞
|𝑞|
and
𝑑
|𝑑|
are unit vectors.
4. Extracting and storing Wikipedia contents
Wikipedia makes extraction of relevant keywords and phrases relatively easy by providing them in bold,
italics and hyperlink styles. However, being a knowledge hub, terms defined in Wikipedia articles are
different from those in news articles. For instance, a Wikipedia article on President Barack Obama has terms
such as ‘44th
President of the United States,’ whereas a news article might just have ‘President of the United
States.’ We use different regex operations to extract the relevant terms from a Wikipedia article. As a
preprocessing step we collect n-grams (especially for the terms with a preposition) for each terms, perform
word level stemming and stop-words removal. The phrase, ‘44th
President of the United States’ after
preprocessing becomes a collection of terms such as ‘44th
,’ ‘presid,’ ‘unit,’ ‘state,’ ‘presid unit state,’ ‘unit
state,’ etc. The preprocessing step results in a very large dictionary (A dictionary is a collection of all unique
terms). In-order to reduce the size of the dictionary, we write general rules to remove irrelevant terms. For
instance, all the terms having more than five words or without a character or terms that occur only once in
the dictionary are removed. We store all the Wikipedia extracted content in a ‘Topic Corpus,’ where each
row is a document and each document is a collection of term IDs and their frequency of occurrence in the
document. A Topic corpus with three documents looks like:
Doc ID Topic Corpus
Doc 1 [(0,2), (1,3), (2,4), (3,1), (4,1)]
Doc 2 [(0,1), (1,2), (4,2), (5,1), (6,2), (7,5), (8,2)]
Doc 3 [(2,2), (4,1), (6,3), (7,2), (9,2)]
Doc ID Topic
Doc 1 Topic 1
Doc 2 Topic 2
Doc 3 Topic 3
In the above tables, (0, 2) and (1, 3) are tuples, where 0 is the term ID and 2 is the frequency of the term
with ID 0 in Document 1.
5. Finding the First Set of Topics
A Topic Corpus is a collection of the term IDs and their frequency of all Wikipedia documents where each
document has a topic label. For a news article, we preprocess the article with the same steps that we used for
preprocessing the Wikipedia documents. Additionally, we implement few other preprocessing steps to
capture relevant phrases. Then, we provide TF-IDF weights to each term extracted from the news article.
For all the documents in the Topic Corpus, we provide TF-IDF weights to the terms and calculate their
similarity score with the news article using cosine similarity. We then store the top 12 topics with their
relative score for further processing.
The output of the top 12 topics looks like,
{‘Topic1’: ‘score1,’ ‘Topic2’: ‘score2,’……… ‘Topic12’: ‘score12’}
After determining the first set of topics, we store the news article in a News Corpus where each row is a
document and each document is a collection of ‘term ID’ and its ‘frequency’ of occurrence in the document.
Limitations of TF-IDF weighting: The TF-IDF model treats terms and documents independently. For
example, the words ‘snake’ and ‘python’ are treated independently in TF-IDF even though the words
‘snake’ and ‘python’ have strong relationship in the real world.
For a query, 'Python is a huge snake. I saw a Python yesterday,' the top 5 classified topics using TF-IDF
weighting and their similarity score with the query is given as:
Topics Python – Prog.
Language
Django (Web
Framework)
CoffeeScript CUDA Snake
Score 0.13 0.0943 0.0908 0.09019 0.088
Table 1.0: Topics and their relative scores using TF-IDF weighting
The topics Python (programming language), Django (Web framework), CoffeeScript and CUDA have
higher similarity score with the query than the topic snake, despite the query talks about python as a snake.
This is because of the very high frequency of the word ‘python’ in the top four documents and relatively
very low frequency of the words ‘python’ and ‘snake’ in the topic snake. This is a general problem that
occurs with TF-IDF weighting. In order to solve this problem we use a second scoring technique, where we
redistribute weights to the terms based on their relevance in a Topic Cluster.
6. Redistribution of Weights using Topic cluster
The two purposes of redistributing weights using topic cluster are:
 To determine the new scores of the topics based of the relevance of a term in a Topic Cluster.
 To discover new relevant terms that are not present in the classified topic but have very high weights
in the Topic cluster. This approach can be used to augment the term list for a given topic.
6.1 Topic Cluster: A Topic Cluster is a collection of similar documents with the topic as the centroid. We
use Cosine similarity measure to determine the similarity between the documents. For example, for the topic
‘Snake,’ articles talking about ‘Reptiles’ or ‘Mammals’ would fall under its Topic cluster. Similarly, for the
topic Python (programming language), articles on programming languages or related topics would fall under
its Topic cluster.
Fig 1.1: Topic Custer of Topic Snake Fig 1.2 Topic cluster of Topic Python (Prog. Language)
6.2 Determining new scores: We use the formula given below, to calculate the relevance of the terms in a
topic cluster and to determine the new score of the topics for the given query terms.
The relevance of term (t) in the Topic Cluster (Tc) for topic (T) is given as:
𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 (𝑡) = 𝑃(𝑡|𝑇𝑐) ∗ log( ∑ (𝑃(𝑡|𝑇′
))−1
𝑇′ ∈ 𝑇𝑐
′
The new score of the Topic (T) with the Query q is given as:
𝑁𝑒𝑤 𝑇𝑜𝑝𝑖𝑐 𝑆𝑐𝑜𝑟𝑒 (𝑇) = log ∑ 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 (𝑡)
𝑡 ∈𝑞
For a topic T, t is a query term in that topic, Tc is the topic cluster formed by the topic T and 𝑇𝑐
′
is the set of
all other topic clusters.
𝑃(𝑡|𝑇𝑐) =
𝑛𝑜. 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑇𝑜𝑝𝑖𝑐 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑇𝑐
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑇𝑜𝑝𝑖𝑐 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑇𝑐
The first probability measure 𝑃(𝑡|𝑇𝑐) gives high weight to the term that occur in more number of documents
in a Topic Cluster and the probability measure ∑ (𝑃(𝑡|𝑇′
))−1
𝑇′ ∈ 𝑇𝑐
′ reduces the weight of the terms that occur
across many Topic Clusters. For the query ‘Python is a huge snake. I saw a Python yesterday,’ the term
‘snake’ receives higher weight as it is only present in the Topic cluster of snake, whereas the term ‘python’
receives relatively lesser weight because it is present in almost all the topic cluster (snake, Python-Prog
language, Django, CoffeeScript, CUDA and others).
For the query 'Python is a huge snake. I saw a Python yesterday.' the top 5 topics and their scores using the
second scoring measure are given as,
Topics Snake Python (Prog.
Language)
Ruby (Prog.
Language)
CoffeeScript CUDA
Score 1.91 0.92 0.836 0.759 0.7542
Table 1.1: Topics and their relative scores using Topic Cluster
6.3 Discovering relevant terms: After we determine the weights of all the terms of a topic using the steps
defined in section 6.2, we collect all the terms that have high weight (exceeding a certain threshold level)
and are not present in the topic. For example, if the term ‘venomous’ is not present in the term list of Topic
‘Snake,’ but has a very high weight in the Topic cluster, then we say that the term ‘venomous’ has a strong
relationship with topic ‘Snake.’ And therefore, we augment the topic ‘Snake’ with the new term. This step is
only done for top two topics upon user verification.
The Clustering process is a time taking process, therefore, it is scheduled as a background job.
7. Calculating final scores
We collect the scores obtained from both TF-IDF scoring and the second scoring approach and simply
multiply them to obtain the final scores of the topics. The final score of the top 5 topics for the query 'Python
is a huge snake. I saw a Python yesterday.' Is:
Topics Snake Python (Prog.
Language)
Ruby (Prog.
Language)
CoffeeScript CUDA
Score 0.169 0 0.128 0.0716 0.069 0.068
Table 1.2: Topics and their final scores
8. Experiment Results and Findings
We ran the whole process for more than 5000 news articles and manually evaluated a set of 100 articles
belonging to four different categories.
Category 1: The first set of articles we evaluated were about Facebook, Mark Zuckerberg and the ‘Dislike’
button. For almost 50% of the document the TF-IDF scoring classified the topic ‘Button’ with highest score,
mainly because of the very high frequency of the term ‘button’ in the news articles and shorter length of the
Wiki article on topic ‘Button’. On the other hand, the cumulative score scored the topic “Mark Zuckerberg”,
“Facebook” and “Social Media” higher than the topic “Button”.
For a group of articles that discussed about the introduction of ‘Dislike’ button and how this option can
elevate Cyberbullying; TF-IDF performed poorly by providing the topic ‘Cyberbullying’ with lesser score
than other trivial topics. On the other hand, the cumulative approach scored Cyberbullying high and ranked
it in the top 3.
Category 2: The second set of articles we evaluated were mostly financial and investment related. For
almost all articles with moderate content, the cumulative scoring approach classified first three topics with
precision. The misclassified articles mainly fell under the category of smaller articles with words ranging
from 50-120 and without relevant content. The top four topics in most cases were “Investments”, “Finance”,
“Investment Banking” and “Security (finance)”.
Category 3: Our third set of articles were about the new image of Pluto by NASA. These articles were
abundant with relevant Entities, such as Pluto, NASA, New horizon, Dwarf planet and others. Since the
articles were rich in content, both the scoring approaches classified the topics well. However the topic
NASA for roughly 30% of the articles, was ranked in the range of 4-7, mainly because the terms in the topic
NASA received both, a low IDF weight and a low Topic cluster weight.
Category 4: The fourth set was randomly selected news articles and we received approximately 85% of the
articles to be classified to the user’s satisfaction.
For the articles with good content and term presence, TF-IDF did a good job in classifying the topics,
however for articles with mixture of many topics where the frequency of the terms pertaining to the topics
were less, TF-IDF failed to capture the topic. For almost every article, the cumulative approach did better
than general TF-IDF in ranking the topics by their relevance.
The classification using the cumulative approach mainly suffered by three reasons:
1. The Wikipedia contents were not very news friendly. We believe augmenting the Topic corpus with
relevant terms extracted from news articles might help in tackling this problem.
2. Articles very small in size with fewer terms in common, scored more than articles larger in length with
more terms in common.
3. Shortage of topics posed another problem.
8.1 Clustering on both, the topic corpus and the news corpus: After we had a set of classified news articles
in the News corpus, we ran the clustering job, merging both the Topic corpus (Wikipedia articles) and the
News corpus and evaluated the new clusters formed. It was found that many news articles glued to the Topic
Clusters of the topics to which they were similar.
Fig 1.3: Topic Custer of Topic Mark Zuckerberg Fig 1.4: Topic cluster of Topic Mark Zuckerberg
Clustered on Wiki Articles clustered on Wiki Articles and News articles
Fig 1.3 shows the topic cluster of the Topic ‘Mark Zuckerberg’ before news articles were involved in
clustering. There are only few documents in the cluster. Fig 1.4 shows the topic cluster after news articles
were involved in clustering. Here, the number of documents in the cluster has increased.
The increase in cluster size with more news article may help refine the process of discovering relevant terms
discussed in the section 6.3. As the frequency of the news articles increases in the Topic Cluster, the chance
of getting relevant terms from the news article also increases. Additionally, augmenting the topics with these
terms will improve the classification precision in the future.
Our experiments on the new Topic cluster formed by the topic Mark Zuckerberg defined the terms ‘CEO
Mark Zuckerberg’, ‘founder’, ‘chief’, ‘like button’, ‘dislike button’, ‘idea’, ‘Facebook status’, ‘companies’,
‘realist’ and others as relevant to the topic Mark Zuckerberg. We believe these terms, if added to the topic
Mark Zuckerberg would increase our classification precision in the future.
9. Determining new topics
We are currently working on different set of rules to discover new potential topics. For starters, we collect
all the terms in quotes and all the Named Entities that have high weight in a Topic Cluster. Once a potential
topic is verified by a user, we fetch the Wikipedia contents for the topic and store it in the Topic corpus.
10. Conclusion
In this paper we present a multi-scoring approach using the TF-IDF weighting and term weighting based on
their relevance in a Topic cluster. We then present the drawback of TF-IDF weighting and how the second
scoring approach assists in improving the classification. Experiments conducted with different sets of news
articles show that the cumulative scoring results in better classification in almost every case. Finally, we
present an intuition and a scenario on how the term weighting in Topic clusters can assist in discovering new
relevant terms for a Topic.

More Related Content

What's hot

Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델guesta34d441
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)Sihan Chen
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graphHarry Potter
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02Jeet Das
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector spaceUjjawal
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 

What's hot (19)

Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
Text categorization
Text categorizationText categorization
Text categorization
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델
 
Ir models
Ir modelsIr models
Ir models
 
A scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linkingA scalable gibbs sampler for probabilistic entity linking
A scalable gibbs sampler for probabilistic entity linking
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
Text categorization as graph
Text categorization as graphText categorization as graph
Text categorization as graph
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02
 
Scoring, term weighting and the vector space
Scoring, term weighting and the vector spaceScoring, term weighting and the vector space
Scoring, term weighting and the vector space
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 

Viewers also liked

Registering Joint Venture Company in Myanmar
Registering Joint Venture Company in MyanmarRegistering Joint Venture Company in Myanmar
Registering Joint Venture Company in MyanmarLawPlus Ltd.
 
Target audience research
Target audience researchTarget audience research
Target audience researchAlister43434
 
Ppt1 zanellato
Ppt1 zanellatoPpt1 zanellato
Ppt1 zanellatocary1978
 
Resume_AmitJain_Testing_7YrsExp
Resume_AmitJain_Testing_7YrsExpResume_AmitJain_Testing_7YrsExp
Resume_AmitJain_Testing_7YrsExpAMIT JAIN
 
power sobre el agua
power sobre el aguapower sobre el agua
power sobre el aguacary1978
 
2017 Investment Outlook & Issues_FINAL_booklet format
2017 Investment Outlook & Issues_FINAL_booklet format2017 Investment Outlook & Issues_FINAL_booklet format
2017 Investment Outlook & Issues_FINAL_booklet formatJuliane Morris, MBA, CWS
 

Viewers also liked (11)

Registering Joint Venture Company in Myanmar
Registering Joint Venture Company in MyanmarRegistering Joint Venture Company in Myanmar
Registering Joint Venture Company in Myanmar
 
Target audience research
Target audience researchTarget audience research
Target audience research
 
2016BACResults_Nov22
2016BACResults_Nov222016BACResults_Nov22
2016BACResults_Nov22
 
HSE Ahmed shendy
HSE Ahmed shendyHSE Ahmed shendy
HSE Ahmed shendy
 
Ppt1 zanellato
Ppt1 zanellatoPpt1 zanellato
Ppt1 zanellato
 
Balenya
BalenyaBalenya
Balenya
 
Abhijith (2)
Abhijith (2)Abhijith (2)
Abhijith (2)
 
Resume_AmitJain_Testing_7YrsExp
Resume_AmitJain_Testing_7YrsExpResume_AmitJain_Testing_7YrsExp
Resume_AmitJain_Testing_7YrsExp
 
power sobre el agua
power sobre el aguapower sobre el agua
power sobre el agua
 
2017 Investment Outlook & Issues_FINAL_booklet format
2017 Investment Outlook & Issues_FINAL_booklet format2017 Investment Outlook & Issues_FINAL_booklet format
2017 Investment Outlook & Issues_FINAL_booklet format
 
Marcas blancas
Marcas blancasMarcas blancas
Marcas blancas
 

Similar to M.S. in Data Science - Topic Modeling Challenges

A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries鍾誠 陳鍾誠
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachFindwise
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
 
Recommender systems
Recommender systemsRecommender systems
Recommender systemsVenkat Raman
 

Similar to M.S. in Data Science - Topic Modeling Challenges (20)

A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
Lec1
Lec1Lec1
Lec1
 
A Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without DictionariesA Document Similarity Measurement without Dictionaries
A Document Similarity Measurement without Dictionaries
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Extractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised ApproachExtractive Document Summarization - An Unsupervised Approach
Extractive Document Summarization - An Unsupervised Approach
 
G04124041046
G04124041046G04124041046
G04124041046
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Ir 02
Ir   02Ir   02
Ir 02
 
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
text
texttext
text
 

M.S. in Data Science - Topic Modeling Challenges

  • 1. M.S. in Data Science - Writing Sample Topic Modeling A Wikipedia article is long and descriptive write up that is often seen structured in different sub headings. On the other hand, a news article focuses only on the critical issues that are important to the readers. Such drastic dissimilarity is just one of the many challenges to Topic Models. Other small, yet significant challenges may be the article size, depth of the topic, and intended audience. Obviously, different applications in Topic Modeling are based on different levels of topic supervision. Of many state-of-the-art approaches for Topic modeling, the most widely used are Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). These are unsupervised techniques that do not need labeled training data. In these approaches each document may be viewed as a mixture of various topics and each topic may be viewed as a mixture of various words. For example, an LDA model may have topics that can be classified as ‘CAT_related’ and ‘DOG_related.’ A topic has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as ‘CAT_related.’ Naturally, the word ‘cat’ itself will have high probability given this topic. [Excerpt taken from Wikipedia]. However, these approaches need a human (user) to label the topics, for example, for a topic with a collection of words such as meow, milk and kitten, a human would label the topic as ‘CAT_related’. For an application that requires narrow supervision on topics, for example instead of labeling a topic as ‘Music,’ if one requires supervision on the ‘Pop,’ ‘rock’ or ‘heavy metal’ types of ‘music,’ manual labeling becomes a daunting task. Moreover, humans are prone to errors. For our application, we use a semi-supervised approach in which we take a set of 5000 topics defined by Prismatic. Then, we extract Wikipedia content for these 5000 topics to use as a training sample. We use general rules and a multi-level scoring approach to gain high precision in classification of the topics. The paper discusses the approach taken for Topic Modeling specific to our application. 2. Overview of TF-IDF weighting Term frequency–Inverse Document Frequency, a numerical statistic, reflects how important a word is to a document in a collection/corpus. TF-IDF determines the relative frequency of a term in a document compared to the inverse proportion of that term over the entire document corpus [excerpt taken from Juan Ramos’ paper on TF-IDF]. 𝑇𝐹 − 𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹 𝑇𝐹 = 1 + log(𝑓(𝑡, 𝑑)) 𝐼𝐷𝐹 = 1 + log10(𝐷/𝑓(𝑡, 𝐷)) Where f (t, d) is the frequency of the term (t) in a document (d), D is the size of the corpus, and f (t, D) is the number of documents in which the term (t) occurs. Assuming that the frequency of the term ‘presid unit state’ is 4 in the document d and is present in 10 documents out of a total of 5000 documents, the TF-IDF weighting for the term ‘presid unit state’ will be, TF-IDF = (1 + log (4)) * (1 + log10 (5000/10)) = 8.8268
  • 2. 3. Overview of Cosine Similarity Cosine similarity measures the cosine of the angle between two vectors. It is a widely used technique to compute the similarity between two documents, where each document is a vector of TF-IDF weights. Fig 1.0 Vector space Model Figure 1.0 shows a two dimensional vector space where d1 and d2 are documents and q is the query. The distance between query q and the document d2 is very large when compared to the distance between the query q and document d1. However, the distribution of terms in the query q is very similar to the distribution of terms in document d2. Cosine similarity finds the angle between the query and the document rather than finding the distance between them. 𝑆𝑖𝑚 = 𝑞. 𝑑 |𝑞||𝑑| = 𝑞 |𝑞| . 𝑑 |𝑑| Where (q.d) is the dot product of query q and document d and 𝑞 |𝑞| and 𝑑 |𝑑| are unit vectors. 4. Extracting and storing Wikipedia contents Wikipedia makes extraction of relevant keywords and phrases relatively easy by providing them in bold, italics and hyperlink styles. However, being a knowledge hub, terms defined in Wikipedia articles are different from those in news articles. For instance, a Wikipedia article on President Barack Obama has terms such as ‘44th President of the United States,’ whereas a news article might just have ‘President of the United States.’ We use different regex operations to extract the relevant terms from a Wikipedia article. As a preprocessing step we collect n-grams (especially for the terms with a preposition) for each terms, perform word level stemming and stop-words removal. The phrase, ‘44th President of the United States’ after preprocessing becomes a collection of terms such as ‘44th ,’ ‘presid,’ ‘unit,’ ‘state,’ ‘presid unit state,’ ‘unit state,’ etc. The preprocessing step results in a very large dictionary (A dictionary is a collection of all unique terms). In-order to reduce the size of the dictionary, we write general rules to remove irrelevant terms. For instance, all the terms having more than five words or without a character or terms that occur only once in the dictionary are removed. We store all the Wikipedia extracted content in a ‘Topic Corpus,’ where each row is a document and each document is a collection of term IDs and their frequency of occurrence in the document. A Topic corpus with three documents looks like: Doc ID Topic Corpus Doc 1 [(0,2), (1,3), (2,4), (3,1), (4,1)] Doc 2 [(0,1), (1,2), (4,2), (5,1), (6,2), (7,5), (8,2)] Doc 3 [(2,2), (4,1), (6,3), (7,2), (9,2)] Doc ID Topic Doc 1 Topic 1 Doc 2 Topic 2 Doc 3 Topic 3
  • 3. In the above tables, (0, 2) and (1, 3) are tuples, where 0 is the term ID and 2 is the frequency of the term with ID 0 in Document 1. 5. Finding the First Set of Topics A Topic Corpus is a collection of the term IDs and their frequency of all Wikipedia documents where each document has a topic label. For a news article, we preprocess the article with the same steps that we used for preprocessing the Wikipedia documents. Additionally, we implement few other preprocessing steps to capture relevant phrases. Then, we provide TF-IDF weights to each term extracted from the news article. For all the documents in the Topic Corpus, we provide TF-IDF weights to the terms and calculate their similarity score with the news article using cosine similarity. We then store the top 12 topics with their relative score for further processing. The output of the top 12 topics looks like, {‘Topic1’: ‘score1,’ ‘Topic2’: ‘score2,’……… ‘Topic12’: ‘score12’} After determining the first set of topics, we store the news article in a News Corpus where each row is a document and each document is a collection of ‘term ID’ and its ‘frequency’ of occurrence in the document. Limitations of TF-IDF weighting: The TF-IDF model treats terms and documents independently. For example, the words ‘snake’ and ‘python’ are treated independently in TF-IDF even though the words ‘snake’ and ‘python’ have strong relationship in the real world. For a query, 'Python is a huge snake. I saw a Python yesterday,' the top 5 classified topics using TF-IDF weighting and their similarity score with the query is given as: Topics Python – Prog. Language Django (Web Framework) CoffeeScript CUDA Snake Score 0.13 0.0943 0.0908 0.09019 0.088 Table 1.0: Topics and their relative scores using TF-IDF weighting The topics Python (programming language), Django (Web framework), CoffeeScript and CUDA have higher similarity score with the query than the topic snake, despite the query talks about python as a snake. This is because of the very high frequency of the word ‘python’ in the top four documents and relatively very low frequency of the words ‘python’ and ‘snake’ in the topic snake. This is a general problem that occurs with TF-IDF weighting. In order to solve this problem we use a second scoring technique, where we redistribute weights to the terms based on their relevance in a Topic Cluster. 6. Redistribution of Weights using Topic cluster The two purposes of redistributing weights using topic cluster are:  To determine the new scores of the topics based of the relevance of a term in a Topic Cluster.
  • 4.  To discover new relevant terms that are not present in the classified topic but have very high weights in the Topic cluster. This approach can be used to augment the term list for a given topic. 6.1 Topic Cluster: A Topic Cluster is a collection of similar documents with the topic as the centroid. We use Cosine similarity measure to determine the similarity between the documents. For example, for the topic ‘Snake,’ articles talking about ‘Reptiles’ or ‘Mammals’ would fall under its Topic cluster. Similarly, for the topic Python (programming language), articles on programming languages or related topics would fall under its Topic cluster. Fig 1.1: Topic Custer of Topic Snake Fig 1.2 Topic cluster of Topic Python (Prog. Language) 6.2 Determining new scores: We use the formula given below, to calculate the relevance of the terms in a topic cluster and to determine the new score of the topics for the given query terms. The relevance of term (t) in the Topic Cluster (Tc) for topic (T) is given as: 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 (𝑡) = 𝑃(𝑡|𝑇𝑐) ∗ log( ∑ (𝑃(𝑡|𝑇′ ))−1 𝑇′ ∈ 𝑇𝑐 ′ The new score of the Topic (T) with the Query q is given as: 𝑁𝑒𝑤 𝑇𝑜𝑝𝑖𝑐 𝑆𝑐𝑜𝑟𝑒 (𝑇) = log ∑ 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 (𝑡) 𝑡 ∈𝑞 For a topic T, t is a query term in that topic, Tc is the topic cluster formed by the topic T and 𝑇𝑐 ′ is the set of all other topic clusters. 𝑃(𝑡|𝑇𝑐) = 𝑛𝑜. 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑤ℎ𝑖𝑐ℎ 𝑡𝑒𝑟𝑚 𝑡 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑇𝑜𝑝𝑖𝑐 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑇𝑐 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑇𝑜𝑝𝑖𝑐 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑇𝑐 The first probability measure 𝑃(𝑡|𝑇𝑐) gives high weight to the term that occur in more number of documents in a Topic Cluster and the probability measure ∑ (𝑃(𝑡|𝑇′ ))−1 𝑇′ ∈ 𝑇𝑐 ′ reduces the weight of the terms that occur across many Topic Clusters. For the query ‘Python is a huge snake. I saw a Python yesterday,’ the term ‘snake’ receives higher weight as it is only present in the Topic cluster of snake, whereas the term ‘python’ receives relatively lesser weight because it is present in almost all the topic cluster (snake, Python-Prog language, Django, CoffeeScript, CUDA and others).
  • 5. For the query 'Python is a huge snake. I saw a Python yesterday.' the top 5 topics and their scores using the second scoring measure are given as, Topics Snake Python (Prog. Language) Ruby (Prog. Language) CoffeeScript CUDA Score 1.91 0.92 0.836 0.759 0.7542 Table 1.1: Topics and their relative scores using Topic Cluster 6.3 Discovering relevant terms: After we determine the weights of all the terms of a topic using the steps defined in section 6.2, we collect all the terms that have high weight (exceeding a certain threshold level) and are not present in the topic. For example, if the term ‘venomous’ is not present in the term list of Topic ‘Snake,’ but has a very high weight in the Topic cluster, then we say that the term ‘venomous’ has a strong relationship with topic ‘Snake.’ And therefore, we augment the topic ‘Snake’ with the new term. This step is only done for top two topics upon user verification. The Clustering process is a time taking process, therefore, it is scheduled as a background job. 7. Calculating final scores We collect the scores obtained from both TF-IDF scoring and the second scoring approach and simply multiply them to obtain the final scores of the topics. The final score of the top 5 topics for the query 'Python is a huge snake. I saw a Python yesterday.' Is: Topics Snake Python (Prog. Language) Ruby (Prog. Language) CoffeeScript CUDA Score 0.169 0 0.128 0.0716 0.069 0.068 Table 1.2: Topics and their final scores 8. Experiment Results and Findings We ran the whole process for more than 5000 news articles and manually evaluated a set of 100 articles belonging to four different categories. Category 1: The first set of articles we evaluated were about Facebook, Mark Zuckerberg and the ‘Dislike’ button. For almost 50% of the document the TF-IDF scoring classified the topic ‘Button’ with highest score, mainly because of the very high frequency of the term ‘button’ in the news articles and shorter length of the Wiki article on topic ‘Button’. On the other hand, the cumulative score scored the topic “Mark Zuckerberg”, “Facebook” and “Social Media” higher than the topic “Button”. For a group of articles that discussed about the introduction of ‘Dislike’ button and how this option can elevate Cyberbullying; TF-IDF performed poorly by providing the topic ‘Cyberbullying’ with lesser score than other trivial topics. On the other hand, the cumulative approach scored Cyberbullying high and ranked it in the top 3.
  • 6. Category 2: The second set of articles we evaluated were mostly financial and investment related. For almost all articles with moderate content, the cumulative scoring approach classified first three topics with precision. The misclassified articles mainly fell under the category of smaller articles with words ranging from 50-120 and without relevant content. The top four topics in most cases were “Investments”, “Finance”, “Investment Banking” and “Security (finance)”. Category 3: Our third set of articles were about the new image of Pluto by NASA. These articles were abundant with relevant Entities, such as Pluto, NASA, New horizon, Dwarf planet and others. Since the articles were rich in content, both the scoring approaches classified the topics well. However the topic NASA for roughly 30% of the articles, was ranked in the range of 4-7, mainly because the terms in the topic NASA received both, a low IDF weight and a low Topic cluster weight. Category 4: The fourth set was randomly selected news articles and we received approximately 85% of the articles to be classified to the user’s satisfaction. For the articles with good content and term presence, TF-IDF did a good job in classifying the topics, however for articles with mixture of many topics where the frequency of the terms pertaining to the topics were less, TF-IDF failed to capture the topic. For almost every article, the cumulative approach did better than general TF-IDF in ranking the topics by their relevance. The classification using the cumulative approach mainly suffered by three reasons: 1. The Wikipedia contents were not very news friendly. We believe augmenting the Topic corpus with relevant terms extracted from news articles might help in tackling this problem. 2. Articles very small in size with fewer terms in common, scored more than articles larger in length with more terms in common. 3. Shortage of topics posed another problem. 8.1 Clustering on both, the topic corpus and the news corpus: After we had a set of classified news articles in the News corpus, we ran the clustering job, merging both the Topic corpus (Wikipedia articles) and the News corpus and evaluated the new clusters formed. It was found that many news articles glued to the Topic Clusters of the topics to which they were similar. Fig 1.3: Topic Custer of Topic Mark Zuckerberg Fig 1.4: Topic cluster of Topic Mark Zuckerberg Clustered on Wiki Articles clustered on Wiki Articles and News articles
  • 7. Fig 1.3 shows the topic cluster of the Topic ‘Mark Zuckerberg’ before news articles were involved in clustering. There are only few documents in the cluster. Fig 1.4 shows the topic cluster after news articles were involved in clustering. Here, the number of documents in the cluster has increased. The increase in cluster size with more news article may help refine the process of discovering relevant terms discussed in the section 6.3. As the frequency of the news articles increases in the Topic Cluster, the chance of getting relevant terms from the news article also increases. Additionally, augmenting the topics with these terms will improve the classification precision in the future. Our experiments on the new Topic cluster formed by the topic Mark Zuckerberg defined the terms ‘CEO Mark Zuckerberg’, ‘founder’, ‘chief’, ‘like button’, ‘dislike button’, ‘idea’, ‘Facebook status’, ‘companies’, ‘realist’ and others as relevant to the topic Mark Zuckerberg. We believe these terms, if added to the topic Mark Zuckerberg would increase our classification precision in the future. 9. Determining new topics We are currently working on different set of rules to discover new potential topics. For starters, we collect all the terms in quotes and all the Named Entities that have high weight in a Topic Cluster. Once a potential topic is verified by a user, we fetch the Wikipedia contents for the topic and store it in the Topic corpus. 10. Conclusion In this paper we present a multi-scoring approach using the TF-IDF weighting and term weighting based on their relevance in a Topic cluster. We then present the drawback of TF-IDF weighting and how the second scoring approach assists in improving the classification. Experiments conducted with different sets of news articles show that the cumulative scoring results in better classification in almost every case. Finally, we present an intuition and a scenario on how the term weighting in Topic clusters can assist in discovering new relevant terms for a Topic.