SlideShare a Scribd company logo
1 of 15
Download to read offline
Topic Modelling
Assigning topic to any text
Team SMM
Mohit Sharma 201505508
Hari Naga Raghavendra Manohar 201505551
K S Chandra Reddy 201505544
Aim
To identify small number of topics/categories that best characterize a given document.
The Categories we considered are:
wiki, art, reference, people, culture, books, design, politics, technology,
psychology, interesting, wikipedia, research, religion, music, math, development,
theory, philosophy, article, language, science, programming, history and software.
Dataset
1.
2.
Approach Used:
LDA
We can use LDA to classify documents in different tags. We know that LDA divides the given corpus in fixed no. of
topics and can also provide which topics are contained in a document and with what probability. For the
experiments performed using LDA, we don’t need to worry about internal implementation of LDA. We used
gensim’s implementation of LDA. To use the library, we just need to know few points about input and output
format. Read the documentation on following link.
https://radimrehurek.com/gensim/wiki.html
During Learning phase
INPUT:
We provide all the wiki documents in single XML file zipped in bz2 format.
LEARNT MODEL:
Word distribution for each topic eg: “topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain +
0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam”
During Testing phase
INPUT:
We provide the document to be classified in bag of words form to the learnt model
OUTPUT:
Topic distribution for a the text eg: “[(34, 0.023705742561150572), (60, 0.017830310671555303), (62,
0.023999239610385081), (83,0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625),
(116,0.022904510579689157)]” represents the probabilities of the doc to fall under topics like 34,60,62….
Major challenge in classification:
It seems to be fairly simple to classify a document in different topics as we can see in output of testing phase. But
our aim is to classify the document under different tags like “politics, science” etc. and not under topic numbers.
Possible Solutions
Clearly we need some way to map all the topics learnt by LDA to the most suitable tags. If we are able to do this
then we simply test the unknown text against the model learnt by LDA and then report the tag corresponding to
the topic given by LDA in output. We tried two different solutions to map topics to the tags:
1. As each topic of LDA is represented by distribution of words. We can create a query by combining those
words and find best matched document on tf-idf basis for that query. That particular document must be the
best match for that topic. So we can map the topic to tag of best matched document.
2. We can find probability distribution of topics for all the documents. Represent each document as a topic
vector. Now find the closest document or the most similar document for each topic. Map the topic to the tag
of that particular document.
Approach 1
We can specify the major steps to implement this approach as follows:
1. Divide the documents in training and test data with 4000 docs in test data.
2. On training data run gensim's LDA and save the learnt model. Set the number of topics as 300.
3. Save all the topics in a file and convert them to queries.
Example topic:
2016-04-06 00:05:52,466 : INFO : topic #299 (0.003): 0.014*insurance + 0.009*scott + 0.007*samurai +
0.007*hipster + 0.006*forecasting + 0.006*fbi + 0.006*imf +
0.005*skeptical + 0.005*bass + 0.005*hidden
Query corresponding to above topic#299:
299:insurance scott samurai hipster forecasting fbi imf skeptical bass hidden
4. For each query, retrieve the most relevant document in training set on tf-idf basis and create topic to doc
Id mapping.
Example:
299:cae3757420fbc4008bbfe492ab0d4cb5
5. Create a topic to tag mapping using the docId to tag mapping (already available in tagData.xml) and doc ID
to topic mapping created in above step.
Example docId to tag from tagData.xml:
cae3757420fbc4008bbfe492ab0d4cb5 : ['wiki', 'en', 'wikipedia,', 'activism', '-‘, 'political', 'poetry', 'free',
'person', 'music', 'encyclopedia', 'the', 'biography', 'history']
Example topic to docId:
299:cae3757420fbc4008bbfe492ab0d4cb5
Example topic to tag:
299:['wiki', 'en', 'wikipedia,', 'activism', '-', 'political', 'poetry', 'free', 'person', 'music‘, 'encyclopedia', 'the',
'biography', 'history']
Now each topic is mapped to multiple tags.
6. For each of the test documents (from 4000 docs in test data), find out the relevant topics using learnt LDA
model. Combine the tags corresponding to them and match them against already available target tags
(from tagData.xml) for that particular document.
If even one tag is matched, we say that document is correctly classified.
Example:
Topic distribution returned by LDA for a particular doc:
[(34, 0.023705742561150572), (60, 0.017830310671555303), (62, 0.023999239610385081), (83,
0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625), (116,
0.022904510579689157), (149, 0.010136256627631658), (155, 0.045428499528247894), (162,
0.014294122339773195), (192, 0.01315170635603234), (193, 0.055764500858303222), (206,
0.015174121956574787), (240, 0.052498569359746373), (243, 0.016285345117555323), (247,
0.019478047862044864), (255, 0.018193391082926114), (263, 0.030209722561452931), (287,
0.042405659613804568), (289, 0.055528896333028231), (291, 0.030064093091433357)]
Tags combined for above topics (from topic to tag mapping created in above step):
['money', 'brain', 'web', 'thinking', 'interesting', 'environment', 'teaching', 'web2.0', 'bio',
'finance', 'government', 'food', 'howto', 'geek', 'cool', 'articles', 'school', 'cognitive', 'cognition',
'energy', 'computerscience', '2read', 'culture', 'computer', 'video', 'home', 'todo', 'investment',
'depression', 'psychology', 'wikipedia', 'research', 'health', 'internet', 'medicine', 'electronics',
'tech', 'math', 'business', 'marketing', 'free', 'standard', 'interface', 'article', 'definition',
'anarchism', 'of', 'study', 'economics', 'programming', 'american', 'games', 'advertising', 'social',
'software', 'apple', 'coding', 'maths', 'learning', 'management', 'system', 'quiz', 'pc', 'music',
'memory', 'war', 'nutrition', 'comparison', 'india', 'info', 'science', 'dev', '@wikipedia', 'future',
'behavior', 'design', 'history', '@read', 'mind', 'hardware', 'webdev', 'politics', 'technology‘]
Target tags for this particular doc from tagData.xml:
['reference', 'economics', 'wikipedia', 'politics', 'reading', 'resources']
Accuracy from this approach: 97%
Problem with this approach:
1. If there is any match between our found tags and true tags, then we call it as correctly classified. Probability
of such scenario is very high as we have multiple found tags and multiple true tags. So even if we are doing
something wrong, chances of getting good accuracy is very high.
2. As we are doing tf-idf based matching then there is high chance that the document we get on top is not best
match for that particular topic. It can also happen because we are not considering all the representative
words of a particular topic to frame the query, we just considered top 10.
Approach 2
After analyzing the data we found that only 25 of the tags represent around 19K documents out of 20K. Which
simply means that we can eliminate the less frequent tags and docs corresponding to them. Which means we
have to divide the corpus among 25 topics at most. Which makes it easier to implement approach 2, as each
document can be easily represented in 25 dimensional topic space. We can specify the major steps to implement
this approach as follows:
1. Eliminate the less frequent tags and documents related to them. keep only top 25. Docs left will be around
19K.
2. On complete data run gensim's LDA and saved the learnt model. Set number of topics set as 25.
3. Save all the topics in a file and convert them to queries as in previous approach.
4. Test each of 19K documents against the learnt model and find the topic distribution eg:
“42d1d305d10b4b025e01e8237c44c87e:0 0 0 0 0.0242823647949 0 0.037682372871 0 0 0 0.0988683434224
0.0113662521741 0.0157100377468 0 0 0.182273317591 0.205447648234 0 0.0524222798936 0.167240557357
0 0.178899361052 0 0 0” represents the probabilities of the doc with given id in 25 different topics.
5. Using above distribution find out the most relevant document for a particular topic and map it to the tag of
that document. It gives the similar topic to tag mapping as in previous approach.
6. Now many topics must have matched to more than one tag. Manually check which tag is best suited for that
particular topic depending on words contained in the topic. As a result we have each topic mapped to at
most one tag.
7. Now perform the testing as done in step 6 of previous approach but on all 19K docs.
Accuracy from this approach: 88%
Problem with this approach:
1. Mapping topics to tags manually is an issue. We can’t always find out the best suited tag just by seeing the
topic words. Sometimes tags don’t reflect anything eg: ‘wikipedia’, ‘wiki’, ‘reference’ create problem.
Modification:
Performed the above experiment again but just with meaningful tags i.e. no tag like ‘wikipedia’, ‘wiki’,
‘reference’ etc. After eliminating these documents left were 17K. But the approach posed another issue:
1. There are similar tags which can represent a topic at the same time eg: [research, science], [web, internet],
[programming, math], [literature, language].
If we keep all such similar tags then accuracy is : 80% but if we strictly keep just one tag then accuracy drops
to 65%.
Reason for the drop is possibly manual work. We can’t surely say which tag should be kept when both tags are
same.
Conclusion: 2ND
approach is better as there is very less chance of false good accuracy and accuracy is also not
bad considering just ~19K documents for learning. Time reuired for leaning is aound 10 minutes and
assigning topic to any new text is intantaneous.
Thank You
IIIT Hyderabad

More Related Content

Similar to Smai Project: Topic Modelling

Wikipedia Document Classification
Wikipedia Document Classification Wikipedia Document Classification
Wikipedia Document Classification Mohit Sharma
 
Week 1 ProjectFor each weeks Project, you will answ.docx
Week 1 ProjectFor each weeks Project, you will answ.docxWeek 1 ProjectFor each weeks Project, you will answ.docx
Week 1 ProjectFor each weeks Project, you will answ.docxco4spmeley
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEEMEMTECHSTUDENTPROJECTS
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —swethaT16
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
 
Deploying Viva Topics
Deploying Viva TopicsDeploying Viva Topics
Deploying Viva TopicsDrew Madelung
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
Objective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample datasetObjective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample datasetJUST36
 
Analyzing Qualitative Data PR1
Analyzing Qualitative Data PR1 Analyzing Qualitative Data PR1
Analyzing Qualitative Data PR1 BobbyPabores1
 
Analyzing the Meaning of the Qualitative Data.pptx
Analyzing the Meaning of the Qualitative Data.pptxAnalyzing the Meaning of the Qualitative Data.pptx
Analyzing the Meaning of the Qualitative Data.pptxBobbyPabores1
 
Topic based and structured authoring - slides
Topic based and structured authoring - slidesTopic based and structured authoring - slides
Topic based and structured authoring - slidesNeil Perlin
 
Topic based and structured authoring - slides
Topic based and structured authoring - slidesTopic based and structured authoring - slides
Topic based and structured authoring - slidesNeil Perlin
 
A quick demo of Top2Vec With application on 2020 10-K business descriptions
A quick demo of Top2Vec With application on 2020 10-K business descriptionsA quick demo of Top2Vec With application on 2020 10-K business descriptions
A quick demo of Top2Vec With application on 2020 10-K business descriptionsGautier Marti
 
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docxCSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docxmydrynan
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_ReportUrjit Patel
 
Trident university itm 301
Trident university itm 301Trident university itm 301
Trident university itm 301leesa marteen
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 

Similar to Smai Project: Topic Modelling (20)

Wikipedia Document Classification
Wikipedia Document Classification Wikipedia Document Classification
Wikipedia Document Classification
 
Week 1 ProjectFor each weeks Project, you will answ.docx
Week 1 ProjectFor each weeks Project, you will answ.docxWeek 1 ProjectFor each weeks Project, you will answ.docx
Week 1 ProjectFor each weeks Project, you will answ.docx
 
No more bad news!
No more bad news!No more bad news!
No more bad news!
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
1. introduction to data science —
1. introduction to data science —1. introduction to data science —
1. introduction to data science —
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
 
Deploying Viva Topics
Deploying Viva TopicsDeploying Viva Topics
Deploying Viva Topics
 
Ire major project
Ire major projectIre major project
Ire major project
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Objective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample datasetObjective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample dataset
 
sdv.pptx
sdv.pptxsdv.pptx
sdv.pptx
 
Analyzing Qualitative Data PR1
Analyzing Qualitative Data PR1 Analyzing Qualitative Data PR1
Analyzing Qualitative Data PR1
 
Analyzing the Meaning of the Qualitative Data.pptx
Analyzing the Meaning of the Qualitative Data.pptxAnalyzing the Meaning of the Qualitative Data.pptx
Analyzing the Meaning of the Qualitative Data.pptx
 
Topic based and structured authoring - slides
Topic based and structured authoring - slidesTopic based and structured authoring - slides
Topic based and structured authoring - slides
 
Topic based and structured authoring - slides
Topic based and structured authoring - slidesTopic based and structured authoring - slides
Topic based and structured authoring - slides
 
A quick demo of Top2Vec With application on 2020 10-K business descriptions
A quick demo of Top2Vec With application on 2020 10-K business descriptionsA quick demo of Top2Vec With application on 2020 10-K business descriptions
A quick demo of Top2Vec With application on 2020 10-K business descriptions
 
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docxCSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
CSIS 100CSIS 100 - Discussion Board Topic #1One of the object.docx
 
DeepSearch_Project_Report
DeepSearch_Project_ReportDeepSearch_Project_Report
DeepSearch_Project_Report
 
Trident university itm 301
Trident university itm 301Trident university itm 301
Trident university itm 301
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 

Recently uploaded

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 

Recently uploaded (20)

SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 

Smai Project: Topic Modelling

  • 1. Topic Modelling Assigning topic to any text Team SMM Mohit Sharma 201505508 Hari Naga Raghavendra Manohar 201505551 K S Chandra Reddy 201505544
  • 2. Aim To identify small number of topics/categories that best characterize a given document. The Categories we considered are: wiki, art, reference, people, culture, books, design, politics, technology, psychology, interesting, wikipedia, research, religion, music, math, development, theory, philosophy, article, language, science, programming, history and software.
  • 6. We can use LDA to classify documents in different tags. We know that LDA divides the given corpus in fixed no. of topics and can also provide which topics are contained in a document and with what probability. For the experiments performed using LDA, we don’t need to worry about internal implementation of LDA. We used gensim’s implementation of LDA. To use the library, we just need to know few points about input and output format. Read the documentation on following link. https://radimrehurek.com/gensim/wiki.html During Learning phase INPUT: We provide all the wiki documents in single XML file zipped in bz2 format. LEARNT MODEL: Word distribution for each topic eg: “topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain + 0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam”
  • 7. During Testing phase INPUT: We provide the document to be classified in bag of words form to the learnt model OUTPUT: Topic distribution for a the text eg: “[(34, 0.023705742561150572), (60, 0.017830310671555303), (62, 0.023999239610385081), (83,0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625), (116,0.022904510579689157)]” represents the probabilities of the doc to fall under topics like 34,60,62…. Major challenge in classification: It seems to be fairly simple to classify a document in different topics as we can see in output of testing phase. But our aim is to classify the document under different tags like “politics, science” etc. and not under topic numbers.
  • 8. Possible Solutions Clearly we need some way to map all the topics learnt by LDA to the most suitable tags. If we are able to do this then we simply test the unknown text against the model learnt by LDA and then report the tag corresponding to the topic given by LDA in output. We tried two different solutions to map topics to the tags: 1. As each topic of LDA is represented by distribution of words. We can create a query by combining those words and find best matched document on tf-idf basis for that query. That particular document must be the best match for that topic. So we can map the topic to tag of best matched document. 2. We can find probability distribution of topics for all the documents. Represent each document as a topic vector. Now find the closest document or the most similar document for each topic. Map the topic to the tag of that particular document.
  • 9. Approach 1 We can specify the major steps to implement this approach as follows: 1. Divide the documents in training and test data with 4000 docs in test data. 2. On training data run gensim's LDA and save the learnt model. Set the number of topics as 300. 3. Save all the topics in a file and convert them to queries. Example topic: 2016-04-06 00:05:52,466 : INFO : topic #299 (0.003): 0.014*insurance + 0.009*scott + 0.007*samurai + 0.007*hipster + 0.006*forecasting + 0.006*fbi + 0.006*imf + 0.005*skeptical + 0.005*bass + 0.005*hidden Query corresponding to above topic#299: 299:insurance scott samurai hipster forecasting fbi imf skeptical bass hidden 4. For each query, retrieve the most relevant document in training set on tf-idf basis and create topic to doc Id mapping. Example: 299:cae3757420fbc4008bbfe492ab0d4cb5
  • 10. 5. Create a topic to tag mapping using the docId to tag mapping (already available in tagData.xml) and doc ID to topic mapping created in above step. Example docId to tag from tagData.xml: cae3757420fbc4008bbfe492ab0d4cb5 : ['wiki', 'en', 'wikipedia,', 'activism', '-‘, 'political', 'poetry', 'free', 'person', 'music', 'encyclopedia', 'the', 'biography', 'history'] Example topic to docId: 299:cae3757420fbc4008bbfe492ab0d4cb5 Example topic to tag: 299:['wiki', 'en', 'wikipedia,', 'activism', '-', 'political', 'poetry', 'free', 'person', 'music‘, 'encyclopedia', 'the', 'biography', 'history'] Now each topic is mapped to multiple tags. 6. For each of the test documents (from 4000 docs in test data), find out the relevant topics using learnt LDA model. Combine the tags corresponding to them and match them against already available target tags (from tagData.xml) for that particular document. If even one tag is matched, we say that document is correctly classified.
  • 11. Example: Topic distribution returned by LDA for a particular doc: [(34, 0.023705742561150572), (60, 0.017830310671555303), (62, 0.023999239610385081), (83, 0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625), (116, 0.022904510579689157), (149, 0.010136256627631658), (155, 0.045428499528247894), (162, 0.014294122339773195), (192, 0.01315170635603234), (193, 0.055764500858303222), (206, 0.015174121956574787), (240, 0.052498569359746373), (243, 0.016285345117555323), (247, 0.019478047862044864), (255, 0.018193391082926114), (263, 0.030209722561452931), (287, 0.042405659613804568), (289, 0.055528896333028231), (291, 0.030064093091433357)] Tags combined for above topics (from topic to tag mapping created in above step): ['money', 'brain', 'web', 'thinking', 'interesting', 'environment', 'teaching', 'web2.0', 'bio', 'finance', 'government', 'food', 'howto', 'geek', 'cool', 'articles', 'school', 'cognitive', 'cognition', 'energy', 'computerscience', '2read', 'culture', 'computer', 'video', 'home', 'todo', 'investment', 'depression', 'psychology', 'wikipedia', 'research', 'health', 'internet', 'medicine', 'electronics', 'tech', 'math', 'business', 'marketing', 'free', 'standard', 'interface', 'article', 'definition', 'anarchism', 'of', 'study', 'economics', 'programming', 'american', 'games', 'advertising', 'social', 'software', 'apple', 'coding', 'maths', 'learning', 'management', 'system', 'quiz', 'pc', 'music', 'memory', 'war', 'nutrition', 'comparison', 'india', 'info', 'science', 'dev', '@wikipedia', 'future', 'behavior', 'design', 'history', '@read', 'mind', 'hardware', 'webdev', 'politics', 'technology‘]
  • 12. Target tags for this particular doc from tagData.xml: ['reference', 'economics', 'wikipedia', 'politics', 'reading', 'resources'] Accuracy from this approach: 97% Problem with this approach: 1. If there is any match between our found tags and true tags, then we call it as correctly classified. Probability of such scenario is very high as we have multiple found tags and multiple true tags. So even if we are doing something wrong, chances of getting good accuracy is very high. 2. As we are doing tf-idf based matching then there is high chance that the document we get on top is not best match for that particular topic. It can also happen because we are not considering all the representative words of a particular topic to frame the query, we just considered top 10.
  • 13. Approach 2 After analyzing the data we found that only 25 of the tags represent around 19K documents out of 20K. Which simply means that we can eliminate the less frequent tags and docs corresponding to them. Which means we have to divide the corpus among 25 topics at most. Which makes it easier to implement approach 2, as each document can be easily represented in 25 dimensional topic space. We can specify the major steps to implement this approach as follows: 1. Eliminate the less frequent tags and documents related to them. keep only top 25. Docs left will be around 19K. 2. On complete data run gensim's LDA and saved the learnt model. Set number of topics set as 25. 3. Save all the topics in a file and convert them to queries as in previous approach. 4. Test each of 19K documents against the learnt model and find the topic distribution eg: “42d1d305d10b4b025e01e8237c44c87e:0 0 0 0 0.0242823647949 0 0.037682372871 0 0 0 0.0988683434224 0.0113662521741 0.0157100377468 0 0 0.182273317591 0.205447648234 0 0.0524222798936 0.167240557357 0 0.178899361052 0 0 0” represents the probabilities of the doc with given id in 25 different topics. 5. Using above distribution find out the most relevant document for a particular topic and map it to the tag of that document. It gives the similar topic to tag mapping as in previous approach. 6. Now many topics must have matched to more than one tag. Manually check which tag is best suited for that particular topic depending on words contained in the topic. As a result we have each topic mapped to at most one tag. 7. Now perform the testing as done in step 6 of previous approach but on all 19K docs.
  • 14. Accuracy from this approach: 88% Problem with this approach: 1. Mapping topics to tags manually is an issue. We can’t always find out the best suited tag just by seeing the topic words. Sometimes tags don’t reflect anything eg: ‘wikipedia’, ‘wiki’, ‘reference’ create problem. Modification: Performed the above experiment again but just with meaningful tags i.e. no tag like ‘wikipedia’, ‘wiki’, ‘reference’ etc. After eliminating these documents left were 17K. But the approach posed another issue: 1. There are similar tags which can represent a topic at the same time eg: [research, science], [web, internet], [programming, math], [literature, language]. If we keep all such similar tags then accuracy is : 80% but if we strictly keep just one tag then accuracy drops to 65%. Reason for the drop is possibly manual work. We can’t surely say which tag should be kept when both tags are same. Conclusion: 2ND approach is better as there is very less chance of false good accuracy and accuracy is also not bad considering just ~19K documents for learning. Time reuired for leaning is aound 10 minutes and assigning topic to any new text is intantaneous.