SlideShare a Scribd company logo
1 of 26
Wikipedia Document
Classification
A review on popular Text Classification Methods
Team 5
Mohit Sharma
201505508
Vijjini Anvesh Rao
201325059
Ravi Teja
201301047
Aim
Given a Wikipedia Document our aim is to say the Categories it may belong to, based on
a Training data in which each Document is tagged to multiple Categories,
The Categories we considered are:
wiki, art, reference, people, culture, books, design, politics, technology,
psychology, interesting, wikipedia, research, religion, music, math, development,
theory, philosophy, article, language, science, programming, history and software.
Dataset
Data set used for all the experiments
We used Wiki10+ data set from following link:
http://nlp.uned.es/social-tagging/wiki10+/
The data set contains following two files:
1. wiki10+_tag-data.tar.gz (3,6 MB): Contains all the tag data for the Wikipedia articles.
2. wiki10+_documents.tar.bz2 (271 MB): Content for all the Wikipedia articles on the
dataset in HTML format. We extracted the text from HTML to run different
experiments.
As we only consider top 25 categories, we removed those documents who don't have
even one of these top 25 categories
Approach 1:
LDA
Latent Dirichlet Allocation for document classification
We can use LDA to classify documents in different tags. We know that LDA divides the given corpus in fixed no. of
topics and can also provide which topics are contained in a document and with what probability. For the
experiments performed using LDA, we don’t need to worry about internal implementation of LDA. We used
gensim’s implementation of LDA. To use the library, we just need to know few points about input and output
format. Read the documentation on following link.
https://radimrehurek.com/gensim/wiki.html
During Learning phase
INPUT:
We provide all the wiki documents in single XML file zipped in bz2 format.
LEARNT MODEL:
Word distribution for each topic eg: “topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain +
0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam”
Latent Dirichlet Allocation for document classification
During Testing phase
INPUT:
We provide the document to be classified in bag of words form to the learnt model
OUTPUT:
Topic distribution for a the text eg: “[(34, 0.023705742561150572), (60, 0.017830310671555303), (62,
0.023999239610385081), (83,0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625),
(116,0.022904510579689157)]” represents the probabilities of the doc to fall under topics like 34,60,62….
Major challenge in classification:
It seems to be fairly simple to classify a document in different topics as we can see in output of testing phase.
But
our aim is to classify the document under different tags like “politics, science” etc. and not under topic
numbers.
Latent Dirichlet Allocation for document classification
Possible Solutions
Clearly we need some way to map all the topics learnt by LDA to the most suitable tags. If we are able to do this
then we simply test the unknown text against the model learnt by LDA and then report the tag corresponding to
the topic given by LDA in output. We tried two different solutions to map topics to the tags:
1. As each topic of LDA is represented by distribution of words. We can create a query by combining those
words and find best matched document on tf-idf basis for that query. That particular document must be
the best match for that topic. So we can map the topic to tag of best matched document.
2. We can find probability distribution of topics for all the documents. Represent each document as a topic
vector. Now find the closest document or the most similar document for each topic. Map the topic to the
tag of that particular document.
Latent Dirichlet Allocation for document classification
Approach 1
We can specify the major steps to implement this approach as follows:
1. Divide the documents in training and test data with 4000 docs in test data.
2. On training data run gensim's LDA and save the learnt model. Set the number of topics as 300.
3. Save all the topics in a file and convert them to queries.
Example topic:
2016-04-06 00:05:52,466 : INFO : topic #299 (0.003): 0.014*insurance + 0.009*scott + 0.007*samurai +
0.007*hipster + 0.006*forecasting + 0.006*fbi + 0.006*imf +
0.005*skeptical + 0.005*bass + 0.005*hidden
Query corresponding to above topic#299:
299:insurance scott samurai hipster forecasting fbi imf skeptical bass hidden
4. For each query, retrieve the most relevant document in training set on tf-idf basis and create topic to doc
Id mapping.
Example:
299:cae3757420fbc4008bbfe492ab0d4cb5
Latent Dirichlet Allocation for document classification
5. Create a topic to tag mapping using the docId to tag mapping (already available in tagData.xml) and doc ID
to topic mapping created in above step.
Example docId to tag from tagData.xml:
cae3757420fbc4008bbfe492ab0d4cb5 : ['wiki', 'en', 'wikipedia,', 'activism', '-‘, 'political', 'poetry', 'free',
'person', 'music', 'encyclopedia', 'the', 'biography', 'history']
Example topic to docId:
299:cae3757420fbc4008bbfe492ab0d4cb5
Example topic to tag:
299:['wiki', 'en', 'wikipedia,', 'activism', '-', 'political', 'poetry', 'free', 'person', 'music‘, 'encyclopedia', 'the',
'biography', 'history']
Now each topic is mapped to multiple tags.
6. For each of the test documents (from 4000 docs in test data), find out the relevant topics using learnt LDA
model. Combine the tags corresponding to them and match them against already available target tags
(from tagData.xml) for that particular document.
If even one tag is matched, we say that document is correctly classified.
Latent Dirichlet Allocation for document classification
Example:
Topic distribution returned by LDA for a particular doc:
[(34, 0.023705742561150572), (60, 0.017830310671555303), (62, 0.023999239610385081), (83,
0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625), (116,
0.022904510579689157), (149, 0.010136256627631658), (155, 0.045428499528247894), (162,
0.014294122339773195), (192, 0.01315170635603234), (193, 0.055764500858303222), (206,
0.015174121956574787), (240, 0.052498569359746373), (243, 0.016285345117555323), (247,
0.019478047862044864), (255, 0.018193391082926114), (263, 0.030209722561452931), (287,
0.042405659613804568), (289, 0.055528896333028231), (291, 0.030064093091433357)]
Tags combined for above topics (from topic to tag mapping created in above step):
['money', 'brain', 'web', 'thinking', 'interesting', 'environment', 'teaching', 'web2.0', 'bio',
'finance', 'government', 'food', 'howto', 'geek', 'cool', 'articles', 'school', 'cognitive', 'cognition',
'energy', 'computerscience', '2read', 'culture', 'computer', 'video', 'home', 'todo', 'investment',
'depression', 'psychology', 'wikipedia', 'research', 'health', 'internet', 'medicine', 'electronics',
'tech', 'math', 'business', 'marketing', 'free', 'standard', 'interface', 'article', 'definition',
'anarchism', 'of', 'study', 'economics', 'programming', 'american', 'games', 'advertising', 'social',
'software', 'apple', 'coding', 'maths', 'learning', 'management', 'system', 'quiz', 'pc', 'music',
'memory', 'war', 'nutrition', 'comparison', 'india', 'info', 'science', 'dev', '@wikipedia', 'future',
'behavior', 'design', 'history', '@read', 'mind', 'hardware', 'webdev', 'politics', 'technology‘]
Latent Dirichlet Allocation for document classification
Target tags for this particular doc from tagData.xml:
['reference', 'economics', 'wikipedia', 'politics', 'reading', 'resources']
Accuracy from this approach: 97%
Problem with this approach:
1. If there is any match between our found tags and true tags, then we call it as correctly classified. Probability
of such scenario is very high as we have multiple found tags and multiple true tags. So even if we are doing
something wrong, chances of getting good accuracy is very high.
2. As we are doing tf-idf based matching then there is high chance that the document we get on top is not best
match for that particular topic. It can also happen because we are not considering all the representative
words of a particular topic to frame the query, we just considered top 10.
Latent Dirichlet Allocation for document classification
Approach 2
After analyzing the data we found that only 25 of the tags represent around 19K documents out of 20K. Which
simply means that we can eliminate the less frequent tags and docs corresponding to them. Which means we
have to divide the corpus among 25 topics at most. Which makes it easier to implement approach 2, as each
document can be easily represented in 25 dimensional topic space. We can specify the major steps to
implement this approach as follows:
1. Eliminate the less frequent tags and documents related to them. keep only top 25. Docs left will be around
19K.
2. On complete data run gensim's LDA and saved the learnt model. Set number of topics set as 25.
3. Save all the topics in a file and convert them to queries as in previous approach.
4. Test each of 19K documents against the learnt model and find the topic distribution eg:
“42d1d305d10b4b025e01e8237c44c87e:0 0 0 0 0.0242823647949 0 0.037682372871 0 0 0 0.0988683434224
0.0113662521741 0.0157100377468 0 0 0.182273317591 0.205447648234 0 0.0524222798936
0.167240557357 0 0.178899361052 0 0 0” represents the probabilities of the doc with given id in 25 different
topics.
5. Using above distribution find out the most relevant document for a particular topic and map it to the tag of
that document. It gives the similar topic to tag mapping as in previous approach.
6. Now many topics must have matched to more than one tag. Manually check which tag is best suited for that
particular topic depending on words contained in the topic. As a result we have each topic mapped to at
most one tag.
7. Now perform the testing as done in step 6 of previous approach but on all 19K docs.
Latent Dirichlet Allocation for document classification
Accuracy from this approach: 88%
Problem with this approach:
1. Mapping topics to tags manually is an issue. We can’t always find out the best suited tag just by seeing the
topic words. Sometimes tags don’t reflect anything eg: ‘wikipedia’, ‘wiki’, ‘reference’ create problem.
Modification:
Performed the above experiment again but just with meaningful tags i.e. no tag like ‘wikipedia’, ‘wiki’,
‘reference’ etc. After eliminating these documents left were 17K. But the approach posed another issue:
1. There are similar tags which can represent a topic at the same time eg: [research, science], [web, internet],
[programming, math], [literature, language].
If we keep all such similar tags then accuracy is : 80% but if we strictly keep just one tag then accuracy drops
to 65%.
Reason for the drop is possibly manual work. We can’t surely say which tag should be kept when both tags are
same.
Conclusion: 2ND approach is better as there is very less chance of false good accuracy and accuracy is also not
bad considering just ~19K documents for learning.
Approach 2:
tf-idf vectors
tf-idf Feature vector
Reflects how important a word is to a document in a collection or corpus
tf-idf value increases proportionally to the number of times a word appears in the document
idf weighting scheme: Inverse document frequency
tf weighting scheme: log normalization
Each vector size could be at max the size of the Vocabulary, Instead only taken top
500 words (most rarest) , restricting maximum number of features to 500
Time taken to build the matrix : Around 5 Minutes
Generates Relative Vectors , that is vector generation of one document depends on
vector generation of other documents, therefore given one new document,
Vectors for all documents have to be updated
as part of preprocessing : Case Folding, Stemming and stopword removal done
(sklearn has inbuilt stop word removal)
Approach 3:
Doc2Vec
Gensim Doc2Vec
Using Distributed Memory Algorithm
Potential to overcome many weaknesses of bag-of-words models.
The vector representations are learned to predict the surrounding words in contexts
sampled from the document.
For Example, For our 20,000 documents, Training takes around 2 hours, and
generates 4GB model which stores the mapping between each document and a
300 size (size chosen arbitrarily) vectors
in other words, training the model itself is the document feature extraction
Because of which it's an even bad Relative feature extraction, even for one more test
document, the entire training of 2 hours needs to be run.
as part of preprocessing : Case Folding, Stemming and stopword removal done
Approach 4:
Word2Vec
Gensim Word2Vec
Word Embedding which is based on Co-Occurrence matrix or explicit representation
in terms of the context in which words appear.
Used Pre-Trained 4.6 GB model trained on English News Corpus
Preprocessing done to
Remove Stop Words
Words not existing in the model removed as no vector available
No stemming or case folding as that's how the model was originally trained i.e
vector(“computer”) ≠ vector(“Computer”) ≠ vector(“Compute”) ≠ vector(“compute”)
For Document Vector , Average done over the 100 size vectors of the words left in the
document after pre-processing
Even though Training the model is already done, Extracting vectors around 4000
times for each of 20000 documents takes a lot of time, 5 hours.
Classification,Results and
Comparison
Testing And Training
Before doing Any Feature Extraction
Documents were randomly shuffled and 5000 documents randomly taken out as
Testing, rest of the around 15,000 as Training.
Assertion was made that every category in testing at least occurs once in training.
For all feature extraction same sets of documents used, no shuffling again so that we
can compare models .
that is all Feature extraction models were trained against same training data and tested against same
testing data.
Classifier Used : SVM with SGD training
Motivation:
Separate Independent Classifier for each label (multi-class training)
and Vectors are of 100 to 1000 in length
Other Classifiers take too much time where Stochastic Gradient Design is very
fast
Sklearn’s Implementation:
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
Accuracy Results
Review
Note that our data is quite imbalanced, that is the green line “Maxclass” Represents the
Accuracy when all vectors in testing are assigned the maximum occurring label from
Training, As evident it is , it's value is almost always above 80%
1. The Topic Distribution Vectors generated from LDA seem to be as bad as Maxclass
assignment, (hence they overlap in the figure) when applied to the classifier
However LDA training might be the fastest of all.
2. It's difficult to compare doc2vec and tf-idf but doc2vec performs better than
word2vec, It's also faster than word2vec when it comes to generating 20,000
vectors
3. Word2Vec didn't perform so good and also took quite a time for vector extraction
of all documents, the only advantage is that it’s feature extraction of one
document doesn't affect other vectors, hence less time for 1 test document
4. tf-idf heavily depends on the number of feature words we consider , In this
experiment we took 500 words, but by increasing the size, accuracy may increase
but will take more time and space to store too.
5. LDA performs good if we look at the accuracy but we can’t guarantee good
Thank You
IIIT Hyderabad

More Related Content

Similar to Wikipedia Document Classification

Smai Project: Topic Modelling
Smai Project: Topic ModellingSmai Project: Topic Modelling
Smai Project: Topic ModellingMohit Sharma
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Zide Meng
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Gabriel Moreira
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEEMEMTECHSTUDENTPROJECTS
 
Objective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample datasetObjective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample datasetJUST36
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshopCesToronto
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Ahmed Saleh
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptxWidsoulDevil
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
Research Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptxResearch Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptxZainab Alhassani
 
Research data management : [part of] PROOF course Finding and controlling sci...
Research data management : [part of] PROOF course Finding and controlling sci...Research data management : [part of] PROOF course Finding and controlling sci...
Research data management : [part of] PROOF course Finding and controlling sci...Leon Osinski
 

Similar to Wikipedia Document Classification (20)

Smai Project: Topic Modelling
Smai Project: Topic ModellingSmai Project: Topic Modelling
Smai Project: Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Meow Hagedorn
Meow HagedornMeow Hagedorn
Meow Hagedorn
 
Bosch, Wackerow: Linked data on the web
Bosch, Wackerow: Linked data on the web Bosch, Wackerow: Linked data on the web
Bosch, Wackerow: Linked data on the web
 
2013.05 - LDOW 2013 @ WWW 2013
2013.05 - LDOW 2013 @ WWW 20132013.05 - LDOW 2013 @ WWW 2013
2013.05 - LDOW 2013 @ WWW 2013
 
Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...Temporal and semantic analysis of richly typed social networks from user-gene...
Temporal and semantic analysis of richly typed social networks from user-gene...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
Objective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample datasetObjective in this milestone, we will· analyze the sample dataset
Objective in this milestone, we will· analyze the sample dataset
 
Qda ces 2013 toronto workshop
Qda ces 2013 toronto workshopQda ces 2013 toronto workshop
Qda ces 2013 toronto workshop
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Research Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptxResearch Methodology (how to choose Datasets ).pptx
Research Methodology (how to choose Datasets ).pptx
 
Research data management : [part of] PROOF course Finding and controlling sci...
Research data management : [part of] PROOF course Finding and controlling sci...Research data management : [part of] PROOF course Finding and controlling sci...
Research data management : [part of] PROOF course Finding and controlling sci...
 

Recently uploaded

Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjMohammed Sikander
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxLimon Prince
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptNishitharanjan Rout
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsSandeep D Chaudhary
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppCeline George
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppCeline George
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxAdelaideRefugio
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxneillewis46
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文中 央社
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...Nguyen Thanh Tu Collection
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMELOISARIVERA8
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project researchCaitlinCummins3
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxCeline George
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfPondicherry University
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptxPoojaSen20
 
PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxMarlene Maheu
 
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaEADTU
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17Celine George
 

Recently uploaded (20)

Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjjStl Algorithms in C++ jjjjjjjjjjjjjjjjjj
Stl Algorithms in C++ jjjjjjjjjjjjjjjjjj
 
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptxAnalyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
Analyzing and resolving a communication crisis in Dhaka textiles LTD.pptx
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
OSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & SystemsOSCM Unit 2_Operations Processes & Systems
OSCM Unit 2_Operations Processes & Systems
 
Improved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio AppImproved Approval Flow in Odoo 17 Studio App
Improved Approval Flow in Odoo 17 Studio App
 
An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
 
Observing-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptxObserving-Correct-Grammar-in-Making-Definitions.pptx
Observing-Correct-Grammar-in-Making-Definitions.pptx
 
Graduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptxGraduate Outcomes Presentation Slides - English (v3).pptx
Graduate Outcomes Presentation Slides - English (v3).pptx
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"Mattingly "AI & Prompt Design: Named Entity Recognition"
Mattingly "AI & Prompt Design: Named Entity Recognition"
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT TOÁN 2024 - TỪ CÁC TRƯỜNG, TRƯỜNG...
 
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUMDEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
DEMONSTRATION LESSON IN ENGLISH 4 MATATAG CURRICULUM
 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
 
How to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptxHow to Manage Website in Odoo 17 Studio App.pptx
How to Manage Website in Odoo 17 Studio App.pptx
 
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdfFICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
FICTIONAL SALESMAN/SALESMAN SNSW 2024.pdf
 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
 
PSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptxPSYPACT- Practicing Over State Lines May 2024.pptx
PSYPACT- Practicing Over State Lines May 2024.pptx
 
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes GuàrdiaPersonalisation of Education by AI and Big Data - Lourdes Guàrdia
Personalisation of Education by AI and Big Data - Lourdes Guàrdia
 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
 

Wikipedia Document Classification

  • 1. Wikipedia Document Classification A review on popular Text Classification Methods Team 5 Mohit Sharma 201505508 Vijjini Anvesh Rao 201325059 Ravi Teja 201301047
  • 2. Aim Given a Wikipedia Document our aim is to say the Categories it may belong to, based on a Training data in which each Document is tagged to multiple Categories, The Categories we considered are: wiki, art, reference, people, culture, books, design, politics, technology, psychology, interesting, wikipedia, research, religion, music, math, development, theory, philosophy, article, language, science, programming, history and software.
  • 4. Data set used for all the experiments We used Wiki10+ data set from following link: http://nlp.uned.es/social-tagging/wiki10+/ The data set contains following two files: 1. wiki10+_tag-data.tar.gz (3,6 MB): Contains all the tag data for the Wikipedia articles. 2. wiki10+_documents.tar.bz2 (271 MB): Content for all the Wikipedia articles on the dataset in HTML format. We extracted the text from HTML to run different experiments. As we only consider top 25 categories, we removed those documents who don't have even one of these top 25 categories
  • 6. Latent Dirichlet Allocation for document classification We can use LDA to classify documents in different tags. We know that LDA divides the given corpus in fixed no. of topics and can also provide which topics are contained in a document and with what probability. For the experiments performed using LDA, we don’t need to worry about internal implementation of LDA. We used gensim’s implementation of LDA. To use the library, we just need to know few points about input and output format. Read the documentation on following link. https://radimrehurek.com/gensim/wiki.html During Learning phase INPUT: We provide all the wiki documents in single XML file zipped in bz2 format. LEARNT MODEL: Word distribution for each topic eg: “topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain + 0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam”
  • 7. Latent Dirichlet Allocation for document classification During Testing phase INPUT: We provide the document to be classified in bag of words form to the learnt model OUTPUT: Topic distribution for a the text eg: “[(34, 0.023705742561150572), (60, 0.017830310671555303), (62, 0.023999239610385081), (83,0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625), (116,0.022904510579689157)]” represents the probabilities of the doc to fall under topics like 34,60,62…. Major challenge in classification: It seems to be fairly simple to classify a document in different topics as we can see in output of testing phase. But our aim is to classify the document under different tags like “politics, science” etc. and not under topic numbers.
  • 8. Latent Dirichlet Allocation for document classification Possible Solutions Clearly we need some way to map all the topics learnt by LDA to the most suitable tags. If we are able to do this then we simply test the unknown text against the model learnt by LDA and then report the tag corresponding to the topic given by LDA in output. We tried two different solutions to map topics to the tags: 1. As each topic of LDA is represented by distribution of words. We can create a query by combining those words and find best matched document on tf-idf basis for that query. That particular document must be the best match for that topic. So we can map the topic to tag of best matched document. 2. We can find probability distribution of topics for all the documents. Represent each document as a topic vector. Now find the closest document or the most similar document for each topic. Map the topic to the tag of that particular document.
  • 9. Latent Dirichlet Allocation for document classification Approach 1 We can specify the major steps to implement this approach as follows: 1. Divide the documents in training and test data with 4000 docs in test data. 2. On training data run gensim's LDA and save the learnt model. Set the number of topics as 300. 3. Save all the topics in a file and convert them to queries. Example topic: 2016-04-06 00:05:52,466 : INFO : topic #299 (0.003): 0.014*insurance + 0.009*scott + 0.007*samurai + 0.007*hipster + 0.006*forecasting + 0.006*fbi + 0.006*imf + 0.005*skeptical + 0.005*bass + 0.005*hidden Query corresponding to above topic#299: 299:insurance scott samurai hipster forecasting fbi imf skeptical bass hidden 4. For each query, retrieve the most relevant document in training set on tf-idf basis and create topic to doc Id mapping. Example: 299:cae3757420fbc4008bbfe492ab0d4cb5
  • 10. Latent Dirichlet Allocation for document classification 5. Create a topic to tag mapping using the docId to tag mapping (already available in tagData.xml) and doc ID to topic mapping created in above step. Example docId to tag from tagData.xml: cae3757420fbc4008bbfe492ab0d4cb5 : ['wiki', 'en', 'wikipedia,', 'activism', '-‘, 'political', 'poetry', 'free', 'person', 'music', 'encyclopedia', 'the', 'biography', 'history'] Example topic to docId: 299:cae3757420fbc4008bbfe492ab0d4cb5 Example topic to tag: 299:['wiki', 'en', 'wikipedia,', 'activism', '-', 'political', 'poetry', 'free', 'person', 'music‘, 'encyclopedia', 'the', 'biography', 'history'] Now each topic is mapped to multiple tags. 6. For each of the test documents (from 4000 docs in test data), find out the relevant topics using learnt LDA model. Combine the tags corresponding to them and match them against already available target tags (from tagData.xml) for that particular document. If even one tag is matched, we say that document is correctly classified.
  • 11. Latent Dirichlet Allocation for document classification Example: Topic distribution returned by LDA for a particular doc: [(34, 0.023705742561150572), (60, 0.017830310671555303), (62, 0.023999239610385081), (83, 0.029439444128473557), (87, 0.028172479800878891), (90, 0.1207424163376625), (116, 0.022904510579689157), (149, 0.010136256627631658), (155, 0.045428499528247894), (162, 0.014294122339773195), (192, 0.01315170635603234), (193, 0.055764500858303222), (206, 0.015174121956574787), (240, 0.052498569359746373), (243, 0.016285345117555323), (247, 0.019478047862044864), (255, 0.018193391082926114), (263, 0.030209722561452931), (287, 0.042405659613804568), (289, 0.055528896333028231), (291, 0.030064093091433357)] Tags combined for above topics (from topic to tag mapping created in above step): ['money', 'brain', 'web', 'thinking', 'interesting', 'environment', 'teaching', 'web2.0', 'bio', 'finance', 'government', 'food', 'howto', 'geek', 'cool', 'articles', 'school', 'cognitive', 'cognition', 'energy', 'computerscience', '2read', 'culture', 'computer', 'video', 'home', 'todo', 'investment', 'depression', 'psychology', 'wikipedia', 'research', 'health', 'internet', 'medicine', 'electronics', 'tech', 'math', 'business', 'marketing', 'free', 'standard', 'interface', 'article', 'definition', 'anarchism', 'of', 'study', 'economics', 'programming', 'american', 'games', 'advertising', 'social', 'software', 'apple', 'coding', 'maths', 'learning', 'management', 'system', 'quiz', 'pc', 'music', 'memory', 'war', 'nutrition', 'comparison', 'india', 'info', 'science', 'dev', '@wikipedia', 'future', 'behavior', 'design', 'history', '@read', 'mind', 'hardware', 'webdev', 'politics', 'technology‘]
  • 12. Latent Dirichlet Allocation for document classification Target tags for this particular doc from tagData.xml: ['reference', 'economics', 'wikipedia', 'politics', 'reading', 'resources'] Accuracy from this approach: 97% Problem with this approach: 1. If there is any match between our found tags and true tags, then we call it as correctly classified. Probability of such scenario is very high as we have multiple found tags and multiple true tags. So even if we are doing something wrong, chances of getting good accuracy is very high. 2. As we are doing tf-idf based matching then there is high chance that the document we get on top is not best match for that particular topic. It can also happen because we are not considering all the representative words of a particular topic to frame the query, we just considered top 10.
  • 13. Latent Dirichlet Allocation for document classification Approach 2 After analyzing the data we found that only 25 of the tags represent around 19K documents out of 20K. Which simply means that we can eliminate the less frequent tags and docs corresponding to them. Which means we have to divide the corpus among 25 topics at most. Which makes it easier to implement approach 2, as each document can be easily represented in 25 dimensional topic space. We can specify the major steps to implement this approach as follows: 1. Eliminate the less frequent tags and documents related to them. keep only top 25. Docs left will be around 19K. 2. On complete data run gensim's LDA and saved the learnt model. Set number of topics set as 25. 3. Save all the topics in a file and convert them to queries as in previous approach. 4. Test each of 19K documents against the learnt model and find the topic distribution eg: “42d1d305d10b4b025e01e8237c44c87e:0 0 0 0 0.0242823647949 0 0.037682372871 0 0 0 0.0988683434224 0.0113662521741 0.0157100377468 0 0 0.182273317591 0.205447648234 0 0.0524222798936 0.167240557357 0 0.178899361052 0 0 0” represents the probabilities of the doc with given id in 25 different topics. 5. Using above distribution find out the most relevant document for a particular topic and map it to the tag of that document. It gives the similar topic to tag mapping as in previous approach. 6. Now many topics must have matched to more than one tag. Manually check which tag is best suited for that particular topic depending on words contained in the topic. As a result we have each topic mapped to at most one tag. 7. Now perform the testing as done in step 6 of previous approach but on all 19K docs.
  • 14. Latent Dirichlet Allocation for document classification Accuracy from this approach: 88% Problem with this approach: 1. Mapping topics to tags manually is an issue. We can’t always find out the best suited tag just by seeing the topic words. Sometimes tags don’t reflect anything eg: ‘wikipedia’, ‘wiki’, ‘reference’ create problem. Modification: Performed the above experiment again but just with meaningful tags i.e. no tag like ‘wikipedia’, ‘wiki’, ‘reference’ etc. After eliminating these documents left were 17K. But the approach posed another issue: 1. There are similar tags which can represent a topic at the same time eg: [research, science], [web, internet], [programming, math], [literature, language]. If we keep all such similar tags then accuracy is : 80% but if we strictly keep just one tag then accuracy drops to 65%. Reason for the drop is possibly manual work. We can’t surely say which tag should be kept when both tags are same. Conclusion: 2ND approach is better as there is very less chance of false good accuracy and accuracy is also not bad considering just ~19K documents for learning.
  • 16. tf-idf Feature vector Reflects how important a word is to a document in a collection or corpus tf-idf value increases proportionally to the number of times a word appears in the document idf weighting scheme: Inverse document frequency tf weighting scheme: log normalization Each vector size could be at max the size of the Vocabulary, Instead only taken top 500 words (most rarest) , restricting maximum number of features to 500 Time taken to build the matrix : Around 5 Minutes Generates Relative Vectors , that is vector generation of one document depends on vector generation of other documents, therefore given one new document, Vectors for all documents have to be updated as part of preprocessing : Case Folding, Stemming and stopword removal done (sklearn has inbuilt stop word removal)
  • 18. Gensim Doc2Vec Using Distributed Memory Algorithm Potential to overcome many weaknesses of bag-of-words models. The vector representations are learned to predict the surrounding words in contexts sampled from the document. For Example, For our 20,000 documents, Training takes around 2 hours, and generates 4GB model which stores the mapping between each document and a 300 size (size chosen arbitrarily) vectors in other words, training the model itself is the document feature extraction Because of which it's an even bad Relative feature extraction, even for one more test document, the entire training of 2 hours needs to be run. as part of preprocessing : Case Folding, Stemming and stopword removal done
  • 20. Gensim Word2Vec Word Embedding which is based on Co-Occurrence matrix or explicit representation in terms of the context in which words appear. Used Pre-Trained 4.6 GB model trained on English News Corpus Preprocessing done to Remove Stop Words Words not existing in the model removed as no vector available No stemming or case folding as that's how the model was originally trained i.e vector(“computer”) ≠ vector(“Computer”) ≠ vector(“Compute”) ≠ vector(“compute”) For Document Vector , Average done over the 100 size vectors of the words left in the document after pre-processing Even though Training the model is already done, Extracting vectors around 4000 times for each of 20000 documents takes a lot of time, 5 hours.
  • 22. Testing And Training Before doing Any Feature Extraction Documents were randomly shuffled and 5000 documents randomly taken out as Testing, rest of the around 15,000 as Training. Assertion was made that every category in testing at least occurs once in training. For all feature extraction same sets of documents used, no shuffling again so that we can compare models . that is all Feature extraction models were trained against same training data and tested against same testing data.
  • 23. Classifier Used : SVM with SGD training Motivation: Separate Independent Classifier for each label (multi-class training) and Vectors are of 100 to 1000 in length Other Classifiers take too much time where Stochastic Gradient Design is very fast Sklearn’s Implementation: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
  • 25. Review Note that our data is quite imbalanced, that is the green line “Maxclass” Represents the Accuracy when all vectors in testing are assigned the maximum occurring label from Training, As evident it is , it's value is almost always above 80% 1. The Topic Distribution Vectors generated from LDA seem to be as bad as Maxclass assignment, (hence they overlap in the figure) when applied to the classifier However LDA training might be the fastest of all. 2. It's difficult to compare doc2vec and tf-idf but doc2vec performs better than word2vec, It's also faster than word2vec when it comes to generating 20,000 vectors 3. Word2Vec didn't perform so good and also took quite a time for vector extraction of all documents, the only advantage is that it’s feature extraction of one document doesn't affect other vectors, hence less time for 1 test document 4. tf-idf heavily depends on the number of feature words we consider , In this experiment we took 500 words, but by increasing the size, accuracy may increase but will take more time and space to store too. 5. LDA performs good if we look at the accuracy but we can’t guarantee good