SlideShare a Scribd company logo
Topic analysis using clustering techniques
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The following study will explore topic data knowledge
discovery by applying clustering techniques over one large
dataset of documents extracted from Wikipedia.
2. Activities
A comparision of Agglomerative hierarchical clustering
O(n2), K-means and presents bisecting K-means O(n) is
given at Steinbach et al. [1]. It also discusses The Vector
Space Model and Document Clustering providing guidance
tf-idf (term frequency-inverse document frequency), and
how to calculate cluster quality such entropy and F measure
(described in the next session).
The differences among cosine, Euclidean, Pearson and
Jaccard similarity measures are shown in Strehl et al. [2].
The conclusion states that ”metric distances such as Eu-
clidean are not appropriate for high dimensional, sparse do-
mains. Cosine, correlation and extended Jaccard are suc-
cessful in capturing the similarities implicitly indicated by
manual categorizations(...)”
As the newest techniques used, we found what is called
LDA (latent dirichlet allocation) at Grun et al. [3] as a way
to capture the semantics of the words as they appear to-
gether (not only individual counting)
3. Proposed Solutions
We will use a non-hierarchical approach over kmeans in
R using Hartigan and Wong [4]. To improve confidence, the
first seed will use the technique of kmeans nstart, which will
try at least 10 random starts and keep the best. Besides, to
avoid memory issues, we will use simple random sample
(without replacement) to select among a subset of texts and
determine the number of topics k with a reasonable confi-
dence.
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
TF(t) = (NbrtimesTermtAppearsDocumentd)/
(totalTermsDocumentd) (1)
IDF(t) = ln(NbrDocuments/NbrDocumentsWithTermt)
(2)
TF − IDF(t) = TF(t) ∗ IDF(t) (3)
As an evaluation of cluster quality we will use (using a
combination of Intra-cluster and Inter-cluster statistics):
Pseudo−F = (between−cluster−sum−of−squares/(k−1))
/(within − cluster − sum − of − squares/(n − k))
(4)
Pseudo-F was proposed by Calinski and Harabasz [5]
and describes the ratio of between cluster variance to
within-cluster variance. If Pseudo F is decreasing, that
means either the within-cluster variance is increasing or
staying static (denominator) or the between cluster vari-
ance is decreasing (numerator). Within cluster variance re-
ally just measures how tight your clusters fit together. The
higher the number the more dispersed the cluster, the lower
the number the more focused the cluster. Between clus-
ter variance measures how seperated clusters are from each
other.
After finding the number of topics using elbow tech-
nique, we will run LDA (latent dirichlet allocation)that
is a recent technique as pointed in Roberts et al. [6] to
find the topics based on the semantic context and com-
pare the clustering results. Another method that will be
tried is PAM (Partitioning Around Medoids) as provided in
Reynolds et al. [7] and using cosine as a dissimilarity mea-
sure as discussed at Strehl et al. [2]. For PAM we will try
two approaches, using unigram and bigrams (such United
States) to analyse if the clustering gets better results by
application LSA (latent semantic analysis) to partition the
document-term space.
1
It was also presented a word cloud and dendogram to
make it easier to see the words present in the overall bag of
words.
4. Experiments and Discussion
The dataset available comes from Wikipedia Documents
and comprises 154,753 documents.
4.1. Data Preprocessing
Firstly, it was removed files starting with dot (as they
did not have content). After that, it was also applied sev-
eral transformations to make it possible to compare differ-
ent documents:
(a) make all words to be lowercase
(b) remove punctuation and numbers
(c) removing english stopwords
(d) stripping whitespaces
(e) english stemming
Although it is not the structure used in the Kmeans, by
using R TermDocument structure it was possible to find
the most frequent terms inside the documents (using 0.2
as a filter). Then, it was created a dendogram (see Figure
1) and it was observed that the majority of the most fre-
quent items were non-relevant words. For this reason, we
added some of these words such as ”also”, ”one”, ”includ”,
”first”,”two”,”three”,”may”,”although”,”throughout”.
Due to memory restrictions it was not possible to find
the total number of terms in the entire set of 150000 docu-
ments. So, firstly we found the number of terms in a simple
random sampling (without replacement) of documents and
keep the result on a document-term matrix structure, and
applying tf-idf normalized, on Table 1):
sample rate % Nbr documents Nbr terms
100 154753 58796
37 56799 35214
30 47260 33047
10 15475 19013
7 10492 13561
5 7737 9806
1 1530 718
Table 1. Overall view of sampling and terms
After that, some tests were done to see how terms are
filtered by applying the removal of sparse items (Table 2):
Figure 1. Dendogram of the most frequent words.
sparsity Nbr documents Total Terms Nbr filtered terms
0.999 15475 19013 12030
0.999 10492 13561 12302
0.999 7737 9806 9806 (no filter)
0.9975 47620 33047 3060
0.9975 15475 19013 3200
0.9975 10492 13561 3111
0.9975 7737 9806 3330
0.99 15475 19013 360
0.99 7737 9806 365
0.98 154753 58796 4752
0.975 15475 19013 139
0.975 7737 9806 141
0.95 154753 58796 2448
0.95 10492 13561 70
0.9 154753 58796 1305
0.9 15475 19013 23
0.9 10492 13561 22
0.9 7737 9806 23
0.8 154753 58796 503
Table 2. Removing non-relevant items based on sparsity
4.2. Kmeans "Elbow" technique
These numbers suggests that from sparse ¡0.9975 the
number of relevant terms remains constant indicating a rela-
tively stable set of the most relevant terms for tf-idf . Based
on the number of terms, we decided to try to find k us-
ing ”elbow” technique for sparsity for 3200 terms (sparsity
0.9975, that is almost the same result as aplying 0.98 for the
entire dataset).
To overcome memory limitations we need to call rm()
as soon as variables becomes unused and gc(). Only this
enable to run ”elbow” technique (we used power 2 to make a
sample until 1024 clusters) whose result is presented below
at Figure 2:
Figure 2. Plotting divergence with respect to the possible number
of groups (applying power 2 to reduce sampling and enable exe-
cution).
By the picture the suggested ”elbow” is around 250 [225
to 275 topics] as a 10-percent error margin.
The most frequent words found using tf-idf were the fol-
lowing (using filter 0.9) at Table 3:
4.2.1 Kmeans: Intra and Inter-cluster statistics
The statistics found for Kmeans for n=7737 documents,
3357 terms, k= 250 are in Table 4:
4.3. LDA - Latent Dirichlet Allocation
The following code was used to discover the topics based
on latent dirichlet allocation (the algorithm used tf as input
instead of tf-idf but it take semantic relationships among the
words inside the documents to identify the topics)
lda <− LDA( docs term t f . new , k ,
control = l i s t ( alpha = 0.1 , seed =2010))
lda@k
( terms <− terms ( lda , 3 ) )
( t o p i c s <− t o p i c s ( lda ) )
wordColumn1 wordColumn1 wordColumn3 wordColumn4
agricultur anglosaxon anniversari archbishop
architectur australian autobiographi background
battlecruis battleship borchgrevink california
canterburi championship characterist commission
commonwealth constantin contemporari controversi
correspond counterattack cunningham difficulti
distinguish documentari elagabalus epaminonda
experiment indonesian instrument kinetoscop
legislatur manufactur manuscript massachusett
mediterranean metropolitan millennium nevertheless
observatori palestinian parliament parliamentari
particular pennsylvania philadelphia photograph
pittsburgh profession relationship republican
settlement soundtrack springfield strengthen
temperatur thoroughbr tournament underground
understand vegetarian washington widespread
youngstown
Table 3. 65 most frequent words using tf-idf
Within-cluster SS Between-cluster SS Pseudo-F
7864.74 15607.05 59.66847
Table 4. K-means statistics
Below a summary of the first 15 topics (clusters) found
and showing the top 3 words inside (see Table 5):
The following histogram shows the frequency of the
number of documents attributed for each topic (Figure 3):
Figure 3. Distribution of documents over the topics.
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
confederaci soundtrack pittsburgh archbishop parliament twickenham counterattack leadership
profession documentari throughout predecessor underground reconstruct breakthrough mainstream
particular screenplay scoreboard canterburi counsellor pedestrian reichswehr ambassador
Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15
metropolitan constantinopl parliamentari correspond temperatur leopardskin spacecraft
librettist particular parliament transcendent declassifi photograph temperatur
spinelloccio myriokephalon legitimaci mathematician decontamin albumdylan perihelion
Table 5. LDA first 15 topics
4.4. PAM using LSA
The last algorithm we tried was PAM (Partitioning
Around Medoids). The results are discussed below:
4.4.1 PAM using bigrams
Differently from the other algorithms it was tried to use a
TermDocument structure extracting bigrams instead of un-
igrams. The following dendogram (Figure 4) summarizes
the most frequent terms:
Figure 4. Dendogram for 20 terms and 7737 documents.
The result can be also viewed in a word cloud (Figure 5):
With this data it was applied PAM using Latent Seman-
tic Analysis to separate the TermDocument structure. The
dissimilarity metric adopted was 1 - cosine of the lsa space.
To enable comparision it was used the same parameters as
the other algorithms (specially k=250 topics). The silhou-
ette in Figure 6 shows the clusters and its width. As width
values are negative, indicating some error (wrong cluster
placement).
Figure 5. Word cloud for 492 terms and 7737 documents.
5. Conclusions and Future Work
Given Kmeans statistics at Table 4, we can see that the
sums of squares inside the cluster is lower than between
cluster sums of squares as expected to achieve a good sepa-
ration among clusters.
Analysing the results at Figure 2, the number of topics
found was around 250 (225-275 topics) using ”elbow” rule
and an error margin of 10-percent.
By using LDA algorithm it was easier to identify the
clusters and the content (3 most relevant terms inside)
which make sense such as topic 2 more related with mu-
sic/movies (soundtrack, documentari, screenplay), there are
some topics not shown due to lack of space such as topic
130 (tchaikovski, petersburg, conservatori) related with the
same major topic but different content. There are some
topics whose major topic is about war specially such topic
7 (counterattack, breakthrough, reichswehr), others about
politics, for example, topic 11 (parliamentari, parliament,
legitimaci).
The histogram at Figure 3, shows the topics with highest
Figure 6. Silhouette using bigram and cosine distance for 7737
documents.
number of attributed documents by LDA.
By using bigrams we discovered some other terms with
high relevance (unit states, possibly inferred as United
States, world war, war II), all of them make sense with
the topics discovered by LDA and the most relevant words
found by tf-idf measure at Table 3.
Unfortunately, PAM using cosine did not bring at good
separation of the clusters indicated by silhouette giving neg-
ative widths, possibly indicating that unigrams are the best
approach for this problem.
As a future work is to investigate manners to print several
topics and terms for LDA in a word cloud or graph to make
it easier to visualize the output provided by this clustering
algorithm.
References
[1] Michael Steinbach, George Karypis, and Vipin Kumar. A
comparison of document clustering techniques. Department
of Computer Science and Egineering, University of Min-
nesota, Technical Report, 00(034), 2000. 1
[2] Joydeep Ghosh Alexander Strehl and Raymond Mooney. Im-
pact of similarity measures on web-page clustering. AAAI-
2000: Workshop of Artificial Intelligence for Web Search,
pages 58–64, 2000. 1
[3] Johannes Kepler Bettina Grun and Kurt Hornik. topicmodels:
An r package for fitting topic models. Journal of Statistical
Software, 40(13), 2011. 1
[4] J. A. Hartigan and M. A. Wong. A k-means clustering algo-
rithm. Applied Statistics, 28:100ˆa“108, 1979. 1
[5] T. Calinski and J. Harabasz. A dendrite method for cluster
analysis. Communications in Statistics, 3(1):1ˆa“27, 1979. 1
[6] Brandon M. Stewart Margaret E. Roberts and Dustin Tingley.
stm: R package for structural topic models. 2014. 1
[7] Richards G. de la Iglesia B. Reynolds, A. and V. Rayward-
Smith. Clustering rules: A comparison of partitioning and
hierarchical clustering algorithms. Journal of Mathematical
Modelling and Algorithms, 5:475ˆa“504, 1992. 1

More Related Content

What's hot

Birch1
Birch1Birch1
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Edureka!
 
Birch
BirchBirch
Birch
ngocdiem87
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Databricks
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applications
Luis Galárraga
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
Data Science Warsaw
 
Jmessp13420104
Jmessp13420104Jmessp13420104
Jmessp13420104
Nertila Ismailaja
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Sean Golliher
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
Krish_ver2
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
Yueshen Xu
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
Vahid Mirjalili
 
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
Comparative Analysis of Algorithms for Single Source Shortest Path ProblemComparative Analysis of Algorithms for Single Source Shortest Path Problem
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
CSCJournals
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
Claudia Wagner
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised Learning
Alia Hamwi
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Anna Fensel
 
lecture 12
lecture 12lecture 12
lecture 12sajinsc
 
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
Devansh16
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 

What's hot (20)

Birch1
Birch1Birch1
Birch1
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Birch
BirchBirch
Birch
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Clique
Clique Clique
Clique
 
Minimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applicationsMinimizing cost in distributed multiquery processing applications
Minimizing cost in distributed multiquery processing applications
 
Metody logiczne w analizie danych
Metody logiczne w analizie danych Metody logiczne w analizie danych
Metody logiczne w analizie danych
 
Jmessp13420104
Jmessp13420104Jmessp13420104
Jmessp13420104
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
Comparative Analysis of Algorithms for Single Source Shortest Path ProblemComparative Analysis of Algorithms for Single Source Shortest Path Problem
Comparative Analysis of Algorithms for Single Source Shortest Path Problem
 
Topic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic ModelsTopic Models - LDA and Correlated Topic Models
Topic Models - LDA and Correlated Topic Models
 
Unsupervised Learning
Unsupervised LearningUnsupervised Learning
Unsupervised Learning
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
lecture 12
lecture 12lecture 12
lecture 12
 
Deep learning ensembles loss landscape
Deep learning ensembles loss landscapeDeep learning ensembles loss landscape
Deep learning ensembles loss landscape
 
Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 

Viewers also liked

2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-projectPaulo Faria
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)Frank McDonald
 
Postcards final-all
Postcards final-allPostcards final-all
Postcards final-all
Hashevaynu
 
Fa102 b
Fa102 bFa102 b
Power Point
Power PointPower Point
Power Point
Robyn Ondrejka
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_fariaPaulo Faria
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Harnoor Sanjeev
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentationCharles Buie
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual Dinner
Hashevaynu
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015Andrew Green
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in education
Waqar Nisa
 

Viewers also liked (15)

Article_6
Article_6Article_6
Article_6
 
2014-mo444-final-project
2014-mo444-final-project2014-mo444-final-project
2014-mo444-final-project
 
vjq_cv2015
vjq_cv2015vjq_cv2015
vjq_cv2015
 
2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)McDONALD_FRANK_RESUME 2016(20)
McDONALD_FRANK_RESUME 2016(20)
 
Postcards final-all
Postcards final-allPostcards final-all
Postcards final-all
 
Fa102 b
Fa102 bFa102 b
Fa102 b
 
Rebellions Excerpt
Rebellions ExcerptRebellions Excerpt
Rebellions Excerpt
 
Power Point
Power PointPower Point
Power Point
 
2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria2014-mo444-practical-assignment-01-paulo_faria
2014-mo444-practical-assignment-01-paulo_faria
 
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersDiminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal Writers
 
CaseStudyIndustryPresentation
CaseStudyIndustryPresentationCaseStudyIndustryPresentation
CaseStudyIndustryPresentation
 
Hashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual DinnerHashevaynu's 13th Annual Dinner
Hashevaynu's 13th Annual Dinner
 
Going Live April 2015
Going Live April 2015Going Live April 2015
Going Live April 2015
 
Service and guidance in education
Service and guidance in educationService and guidance in education
Service and guidance in education
 

Similar to 2014-mo444-practical-assignment-02-paulo_faria

TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
ijnlc
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
kevig
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
kevig
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNidhin Pattaniyil
 
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
orajjournal
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
ijscmcj
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
Denis Parra Santander
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
kevig
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
WrushabhShirsat3
 
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTSIMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
kevig
 
IMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMOIMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMO
ijnlc
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
telss09
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
IJCSEA Journal
 
Entropy scaling search method
Entropy scaling search methodEntropy scaling search method
Entropy scaling search method
Gokulakannan Selvam
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
ijma
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
Paul Houle
 
Coclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain DocumentsCoclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain Documents
lau
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
Enrico Daga
 

Similar to 2014-mo444-practical-assignment-02-paulo_faria (20)

TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
NTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_posterNTCIR11-Math2-PattaniyilN_poster
NTCIR11-Math2-PattaniyilN_poster
 
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...
 
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSSVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONS
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...
 
Lec1
Lec1Lec1
Lec1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTSIMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTS
 
IMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMOIMPROVING SEARCH ENGINES BY DEMO
IMPROVING SEARCH ENGINES BY DEMO
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Entropy scaling search method
Entropy scaling search methodEntropy scaling search method
Entropy scaling search method
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
Coclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain DocumentsCoclustering Base Classification For Out Of Domain Documents
Coclustering Base Classification For Out Of Domain Documents
 
Propagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data FlowsPropagation of Policies in Rich Data Flows
Propagation of Policies in Rich Data Flows
 

2014-mo444-practical-assignment-02-paulo_faria

  • 1. Topic analysis using clustering techniques Paulo Renato de Faria∗ Anderson Rocha† 1. Introduction The following study will explore topic data knowledge discovery by applying clustering techniques over one large dataset of documents extracted from Wikipedia. 2. Activities A comparision of Agglomerative hierarchical clustering O(n2), K-means and presents bisecting K-means O(n) is given at Steinbach et al. [1]. It also discusses The Vector Space Model and Document Clustering providing guidance tf-idf (term frequency-inverse document frequency), and how to calculate cluster quality such entropy and F measure (described in the next session). The differences among cosine, Euclidean, Pearson and Jaccard similarity measures are shown in Strehl et al. [2]. The conclusion states that ”metric distances such as Eu- clidean are not appropriate for high dimensional, sparse do- mains. Cosine, correlation and extended Jaccard are suc- cessful in capturing the similarities implicitly indicated by manual categorizations(...)” As the newest techniques used, we found what is called LDA (latent dirichlet allocation) at Grun et al. [3] as a way to capture the semantics of the words as they appear to- gether (not only individual counting) 3. Proposed Solutions We will use a non-hierarchical approach over kmeans in R using Hartigan and Wong [4]. To improve confidence, the first seed will use the technique of kmeans nstart, which will try at least 10 random starts and keep the best. Besides, to avoid memory issues, we will use simple random sample (without replacement) to select among a subset of texts and determine the number of topics k with a reasonable confi- dence. ∗Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: paulo.faria@gmail.com †Is with the Institute of Computing, University of Campinas (Uni- camp). Contact: anderson.rocha@ic.unicamp.br TF(t) = (NbrtimesTermtAppearsDocumentd)/ (totalTermsDocumentd) (1) IDF(t) = ln(NbrDocuments/NbrDocumentsWithTermt) (2) TF − IDF(t) = TF(t) ∗ IDF(t) (3) As an evaluation of cluster quality we will use (using a combination of Intra-cluster and Inter-cluster statistics): Pseudo−F = (between−cluster−sum−of−squares/(k−1)) /(within − cluster − sum − of − squares/(n − k)) (4) Pseudo-F was proposed by Calinski and Harabasz [5] and describes the ratio of between cluster variance to within-cluster variance. If Pseudo F is decreasing, that means either the within-cluster variance is increasing or staying static (denominator) or the between cluster vari- ance is decreasing (numerator). Within cluster variance re- ally just measures how tight your clusters fit together. The higher the number the more dispersed the cluster, the lower the number the more focused the cluster. Between clus- ter variance measures how seperated clusters are from each other. After finding the number of topics using elbow tech- nique, we will run LDA (latent dirichlet allocation)that is a recent technique as pointed in Roberts et al. [6] to find the topics based on the semantic context and com- pare the clustering results. Another method that will be tried is PAM (Partitioning Around Medoids) as provided in Reynolds et al. [7] and using cosine as a dissimilarity mea- sure as discussed at Strehl et al. [2]. For PAM we will try two approaches, using unigram and bigrams (such United States) to analyse if the clustering gets better results by application LSA (latent semantic analysis) to partition the document-term space. 1
  • 2. It was also presented a word cloud and dendogram to make it easier to see the words present in the overall bag of words. 4. Experiments and Discussion The dataset available comes from Wikipedia Documents and comprises 154,753 documents. 4.1. Data Preprocessing Firstly, it was removed files starting with dot (as they did not have content). After that, it was also applied sev- eral transformations to make it possible to compare differ- ent documents: (a) make all words to be lowercase (b) remove punctuation and numbers (c) removing english stopwords (d) stripping whitespaces (e) english stemming Although it is not the structure used in the Kmeans, by using R TermDocument structure it was possible to find the most frequent terms inside the documents (using 0.2 as a filter). Then, it was created a dendogram (see Figure 1) and it was observed that the majority of the most fre- quent items were non-relevant words. For this reason, we added some of these words such as ”also”, ”one”, ”includ”, ”first”,”two”,”three”,”may”,”although”,”throughout”. Due to memory restrictions it was not possible to find the total number of terms in the entire set of 150000 docu- ments. So, firstly we found the number of terms in a simple random sampling (without replacement) of documents and keep the result on a document-term matrix structure, and applying tf-idf normalized, on Table 1): sample rate % Nbr documents Nbr terms 100 154753 58796 37 56799 35214 30 47260 33047 10 15475 19013 7 10492 13561 5 7737 9806 1 1530 718 Table 1. Overall view of sampling and terms After that, some tests were done to see how terms are filtered by applying the removal of sparse items (Table 2): Figure 1. Dendogram of the most frequent words. sparsity Nbr documents Total Terms Nbr filtered terms 0.999 15475 19013 12030 0.999 10492 13561 12302 0.999 7737 9806 9806 (no filter) 0.9975 47620 33047 3060 0.9975 15475 19013 3200 0.9975 10492 13561 3111 0.9975 7737 9806 3330 0.99 15475 19013 360 0.99 7737 9806 365 0.98 154753 58796 4752 0.975 15475 19013 139 0.975 7737 9806 141 0.95 154753 58796 2448 0.95 10492 13561 70 0.9 154753 58796 1305 0.9 15475 19013 23 0.9 10492 13561 22 0.9 7737 9806 23 0.8 154753 58796 503 Table 2. Removing non-relevant items based on sparsity 4.2. Kmeans "Elbow" technique These numbers suggests that from sparse ¡0.9975 the number of relevant terms remains constant indicating a rela- tively stable set of the most relevant terms for tf-idf . Based on the number of terms, we decided to try to find k us- ing ”elbow” technique for sparsity for 3200 terms (sparsity 0.9975, that is almost the same result as aplying 0.98 for the
  • 3. entire dataset). To overcome memory limitations we need to call rm() as soon as variables becomes unused and gc(). Only this enable to run ”elbow” technique (we used power 2 to make a sample until 1024 clusters) whose result is presented below at Figure 2: Figure 2. Plotting divergence with respect to the possible number of groups (applying power 2 to reduce sampling and enable exe- cution). By the picture the suggested ”elbow” is around 250 [225 to 275 topics] as a 10-percent error margin. The most frequent words found using tf-idf were the fol- lowing (using filter 0.9) at Table 3: 4.2.1 Kmeans: Intra and Inter-cluster statistics The statistics found for Kmeans for n=7737 documents, 3357 terms, k= 250 are in Table 4: 4.3. LDA - Latent Dirichlet Allocation The following code was used to discover the topics based on latent dirichlet allocation (the algorithm used tf as input instead of tf-idf but it take semantic relationships among the words inside the documents to identify the topics) lda <− LDA( docs term t f . new , k , control = l i s t ( alpha = 0.1 , seed =2010)) lda@k ( terms <− terms ( lda , 3 ) ) ( t o p i c s <− t o p i c s ( lda ) ) wordColumn1 wordColumn1 wordColumn3 wordColumn4 agricultur anglosaxon anniversari archbishop architectur australian autobiographi background battlecruis battleship borchgrevink california canterburi championship characterist commission commonwealth constantin contemporari controversi correspond counterattack cunningham difficulti distinguish documentari elagabalus epaminonda experiment indonesian instrument kinetoscop legislatur manufactur manuscript massachusett mediterranean metropolitan millennium nevertheless observatori palestinian parliament parliamentari particular pennsylvania philadelphia photograph pittsburgh profession relationship republican settlement soundtrack springfield strengthen temperatur thoroughbr tournament underground understand vegetarian washington widespread youngstown Table 3. 65 most frequent words using tf-idf Within-cluster SS Between-cluster SS Pseudo-F 7864.74 15607.05 59.66847 Table 4. K-means statistics Below a summary of the first 15 topics (clusters) found and showing the top 3 words inside (see Table 5): The following histogram shows the frequency of the number of documents attributed for each topic (Figure 3): Figure 3. Distribution of documents over the topics.
  • 4. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 confederaci soundtrack pittsburgh archbishop parliament twickenham counterattack leadership profession documentari throughout predecessor underground reconstruct breakthrough mainstream particular screenplay scoreboard canterburi counsellor pedestrian reichswehr ambassador Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 metropolitan constantinopl parliamentari correspond temperatur leopardskin spacecraft librettist particular parliament transcendent declassifi photograph temperatur spinelloccio myriokephalon legitimaci mathematician decontamin albumdylan perihelion Table 5. LDA first 15 topics 4.4. PAM using LSA The last algorithm we tried was PAM (Partitioning Around Medoids). The results are discussed below: 4.4.1 PAM using bigrams Differently from the other algorithms it was tried to use a TermDocument structure extracting bigrams instead of un- igrams. The following dendogram (Figure 4) summarizes the most frequent terms: Figure 4. Dendogram for 20 terms and 7737 documents. The result can be also viewed in a word cloud (Figure 5): With this data it was applied PAM using Latent Seman- tic Analysis to separate the TermDocument structure. The dissimilarity metric adopted was 1 - cosine of the lsa space. To enable comparision it was used the same parameters as the other algorithms (specially k=250 topics). The silhou- ette in Figure 6 shows the clusters and its width. As width values are negative, indicating some error (wrong cluster placement). Figure 5. Word cloud for 492 terms and 7737 documents. 5. Conclusions and Future Work Given Kmeans statistics at Table 4, we can see that the sums of squares inside the cluster is lower than between cluster sums of squares as expected to achieve a good sepa- ration among clusters. Analysing the results at Figure 2, the number of topics found was around 250 (225-275 topics) using ”elbow” rule and an error margin of 10-percent. By using LDA algorithm it was easier to identify the clusters and the content (3 most relevant terms inside) which make sense such as topic 2 more related with mu- sic/movies (soundtrack, documentari, screenplay), there are some topics not shown due to lack of space such as topic 130 (tchaikovski, petersburg, conservatori) related with the same major topic but different content. There are some topics whose major topic is about war specially such topic 7 (counterattack, breakthrough, reichswehr), others about politics, for example, topic 11 (parliamentari, parliament, legitimaci). The histogram at Figure 3, shows the topics with highest
  • 5. Figure 6. Silhouette using bigram and cosine distance for 7737 documents. number of attributed documents by LDA. By using bigrams we discovered some other terms with high relevance (unit states, possibly inferred as United States, world war, war II), all of them make sense with the topics discovered by LDA and the most relevant words found by tf-idf measure at Table 3. Unfortunately, PAM using cosine did not bring at good separation of the clusters indicated by silhouette giving neg- ative widths, possibly indicating that unigrams are the best approach for this problem. As a future work is to investigate manners to print several topics and terms for LDA in a word cloud or graph to make it easier to visualize the output provided by this clustering algorithm. References [1] Michael Steinbach, George Karypis, and Vipin Kumar. A comparison of document clustering techniques. Department of Computer Science and Egineering, University of Min- nesota, Technical Report, 00(034), 2000. 1 [2] Joydeep Ghosh Alexander Strehl and Raymond Mooney. Im- pact of similarity measures on web-page clustering. AAAI- 2000: Workshop of Artificial Intelligence for Web Search, pages 58–64, 2000. 1 [3] Johannes Kepler Bettina Grun and Kurt Hornik. topicmodels: An r package for fitting topic models. Journal of Statistical Software, 40(13), 2011. 1 [4] J. A. Hartigan and M. A. Wong. A k-means clustering algo- rithm. Applied Statistics, 28:100ˆa“108, 1979. 1 [5] T. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in Statistics, 3(1):1ˆa“27, 1979. 1 [6] Brandon M. Stewart Margaret E. Roberts and Dustin Tingley. stm: R package for structural topic models. 2014. 1 [7] Richards G. de la Iglesia B. Reynolds, A. and V. Rayward- Smith. Clustering rules: A comparison of partitioning and hierarchical clustering algorithms. Journal of Mathematical Modelling and Algorithms, 5:475ˆa“504, 1992. 1