2014-mo444-practical-assignment-02-paulo_faria

Topic analysis using clustering techniques
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The following study will explore topic data knowledge
discovery by applying clustering techniques over one large
dataset of documents extracted from Wikipedia.
2. Activities
A comparision of Agglomerative hierarchical clustering
O(n2), K-means and presents bisecting K-means O(n) is
given at Steinbach et al. [1]. It also discusses The Vector
Space Model and Document Clustering providing guidance
tf-idf (term frequency-inverse document frequency), and
how to calculate cluster quality such entropy and F measure
(described in the next session).
The differences among cosine, Euclidean, Pearson and
Jaccard similarity measures are shown in Strehl et al. [2].
The conclusion states that ”metric distances such as Eu-
clidean are not appropriate for high dimensional, sparse do-
mains. Cosine, correlation and extended Jaccard are suc-
cessful in capturing the similarities implicitly indicated by
manual categorizations(...)”
As the newest techniques used, we found what is called
LDA (latent dirichlet allocation) at Grun et al. [3] as a way
to capture the semantics of the words as they appear to-
gether (not only individual counting)
3. Proposed Solutions
We will use a non-hierarchical approach over kmeans in
R using Hartigan and Wong [4]. To improve confidence, the
first seed will use the technique of kmeans nstart, which will
try at least 10 random starts and keep the best. Besides, to
avoid memory issues, we will use simple random sample
(without replacement) to select among a subset of texts and
determine the number of topics k with a reasonable confi-
dence.
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
TF(t) = (NbrtimesTermtAppearsDocumentd)/
(totalTermsDocumentd) (1)
IDF(t) = ln(NbrDocuments/NbrDocumentsWithTermt)
(2)
TF − IDF(t) = TF(t) ∗ IDF(t) (3)
As an evaluation of cluster quality we will use (using a
combination of Intra-cluster and Inter-cluster statistics):
Pseudo−F = (between−cluster−sum−of−squares/(k−1))
/(within − cluster − sum − of − squares/(n − k))
(4)
Pseudo-F was proposed by Calinski and Harabasz [5]
and describes the ratio of between cluster variance to
within-cluster variance. If Pseudo F is decreasing, that
means either the within-cluster variance is increasing or
staying static (denominator) or the between cluster vari-
ance is decreasing (numerator). Within cluster variance re-
ally just measures how tight your clusters fit together. The
higher the number the more dispersed the cluster, the lower
the number the more focused the cluster. Between clus-
ter variance measures how seperated clusters are from each
other.
After finding the number of topics using elbow tech-
nique, we will run LDA (latent dirichlet allocation)that
is a recent technique as pointed in Roberts et al. [6] to
find the topics based on the semantic context and com-
pare the clustering results. Another method that will be
tried is PAM (Partitioning Around Medoids) as provided in
Reynolds et al. [7] and using cosine as a dissimilarity mea-
sure as discussed at Strehl et al. [2]. For PAM we will try
two approaches, using unigram and bigrams (such United
States) to analyse if the clustering gets better results by
application LSA (latent semantic analysis) to partition the
document-term space.
1

It was also presented a word cloud and dendogram to
make it easier to see the words present in the overall bag of
words.
4. Experiments and Discussion
The dataset available comes from Wikipedia Documents
and comprises 154,753 documents.
4.1. Data Preprocessing
Firstly, it was removed files starting with dot (as they
did not have content). After that, it was also applied sev-
eral transformations to make it possible to compare differ-
ent documents:
(a) make all words to be lowercase
(b) remove punctuation and numbers
(c) removing english stopwords
(d) stripping whitespaces
(e) english stemming
Although it is not the structure used in the Kmeans, by
using R TermDocument structure it was possible to find
the most frequent terms inside the documents (using 0.2
as a filter). Then, it was created a dendogram (see Figure
1) and it was observed that the majority of the most fre-
quent items were non-relevant words. For this reason, we
added some of these words such as ”also”, ”one”, ”includ”,
”first”,”two”,”three”,”may”,”although”,”throughout”.
Due to memory restrictions it was not possible to find
the total number of terms in the entire set of 150000 docu-
ments. So, firstly we found the number of terms in a simple
random sampling (without replacement) of documents and
keep the result on a document-term matrix structure, and
applying tf-idf normalized, on Table 1):
sample rate % Nbr documents Nbr terms
100 154753 58796
37 56799 35214
30 47260 33047
10 15475 19013
7 10492 13561
5 7737 9806
1 1530 718
Table 1. Overall view of sampling and terms
After that, some tests were done to see how terms are
filtered by applying the removal of sparse items (Table 2):
Figure 1. Dendogram of the most frequent words.
sparsity Nbr documents Total Terms Nbr filtered terms
0.999 15475 19013 12030
0.999 10492 13561 12302
0.999 7737 9806 9806 (no filter)
0.9975 47620 33047 3060
0.9975 15475 19013 3200
0.9975 10492 13561 3111
0.9975 7737 9806 3330
0.99 15475 19013 360
0.99 7737 9806 365
0.98 154753 58796 4752
0.975 15475 19013 139
0.975 7737 9806 141
0.95 154753 58796 2448
0.95 10492 13561 70
0.9 154753 58796 1305
0.9 15475 19013 23
0.9 10492 13561 22
0.9 7737 9806 23
0.8 154753 58796 503
Table 2. Removing non-relevant items based on sparsity
4.2. Kmeans "Elbow" technique
These numbers suggests that from sparse ¡0.9975 the
number of relevant terms remains constant indicating a rela-
tively stable set of the most relevant terms for tf-idf . Based
on the number of terms, we decided to try to find k us-
ing ”elbow” technique for sparsity for 3200 terms (sparsity
0.9975, that is almost the same result as aplying 0.98 for the

entire dataset).
To overcome memory limitations we need to call rm()
as soon as variables becomes unused and gc(). Only this
enable to run ”elbow” technique (we used power 2 to make a
sample until 1024 clusters) whose result is presented below
at Figure 2:
Figure 2. Plotting divergence with respect to the possible number
of groups (applying power 2 to reduce sampling and enable exe-
cution).
By the picture the suggested ”elbow” is around 250 [225
to 275 topics] as a 10-percent error margin.
The most frequent words found using tf-idf were the fol-
lowing (using filter 0.9) at Table 3:
4.2.1 Kmeans: Intra and Inter-cluster statistics
The statistics found for Kmeans for n=7737 documents,
3357 terms, k= 250 are in Table 4:
4.3. LDA - Latent Dirichlet Allocation
The following code was used to discover the topics based
on latent dirichlet allocation (the algorithm used tf as input
instead of tf-idf but it take semantic relationships among the
words inside the documents to identify the topics)
lda <− LDA( docs term t f . new , k ,
control = l i s t ( alpha = 0.1 , seed =2010))
lda@k
( terms <− terms ( lda , 3 ) )
( t o p i c s <− t o p i c s ( lda ) )
wordColumn1 wordColumn1 wordColumn3 wordColumn4
agricultur anglosaxon anniversari archbishop
architectur australian autobiographi background
battlecruis battleship borchgrevink california
canterburi championship characterist commission
commonwealth constantin contemporari controversi
correspond counterattack cunningham difficulti
distinguish documentari elagabalus epaminonda
experiment indonesian instrument kinetoscop
legislatur manufactur manuscript massachusett
mediterranean metropolitan millennium nevertheless
observatori palestinian parliament parliamentari
particular pennsylvania philadelphia photograph
pittsburgh profession relationship republican
settlement soundtrack springfield strengthen
temperatur thoroughbr tournament underground
understand vegetarian washington widespread
youngstown
Table 3. 65 most frequent words using tf-idf
Within-cluster SS Between-cluster SS Pseudo-F
7864.74 15607.05 59.66847
Table 4. K-means statistics
Below a summary of the first 15 topics (clusters) found
and showing the top 3 words inside (see Table 5):
The following histogram shows the frequency of the
number of documents attributed for each topic (Figure 3):
Figure 3. Distribution of documents over the topics.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
confederaci soundtrack pittsburgh archbishop parliament twickenham counterattack leadership
profession documentari throughout predecessor underground reconstruct breakthrough mainstream
particular screenplay scoreboard canterburi counsellor pedestrian reichswehr ambassador
Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15
metropolitan constantinopl parliamentari correspond temperatur leopardskin spacecraft
librettist particular parliament transcendent declassiﬁ photograph temperatur
spinelloccio myriokephalon legitimaci mathematician decontamin albumdylan perihelion
Table 5. LDA ﬁrst 15 topics
4.4. PAM using LSA
The last algorithm we tried was PAM (Partitioning
Around Medoids). The results are discussed below:
4.4.1 PAM using bigrams
Differently from the other algorithms it was tried to use a
TermDocument structure extracting bigrams instead of un-
igrams. The following dendogram (Figure 4) summarizes
the most frequent terms:
Figure 4. Dendogram for 20 terms and 7737 documents.
The result can be also viewed in a word cloud (Figure 5):
With this data it was applied PAM using Latent Seman-
tic Analysis to separate the TermDocument structure. The
dissimilarity metric adopted was 1 - cosine of the lsa space.
To enable comparision it was used the same parameters as
the other algorithms (specially k=250 topics). The silhou-
ette in Figure 6 shows the clusters and its width. As width
values are negative, indicating some error (wrong cluster
placement).
Figure 5. Word cloud for 492 terms and 7737 documents.
5. Conclusions and Future Work
Given Kmeans statistics at Table 4, we can see that the
sums of squares inside the cluster is lower than between
cluster sums of squares as expected to achieve a good sepa-
ration among clusters.
Analysing the results at Figure 2, the number of topics
found was around 250 (225-275 topics) using ”elbow” rule
and an error margin of 10-percent.
By using LDA algorithm it was easier to identify the
clusters and the content (3 most relevant terms inside)
which make sense such as topic 2 more related with mu-
sic/movies (soundtrack, documentari, screenplay), there are
some topics not shown due to lack of space such as topic
130 (tchaikovski, petersburg, conservatori) related with the
same major topic but different content. There are some
topics whose major topic is about war specially such topic
7 (counterattack, breakthrough, reichswehr), others about
politics, for example, topic 11 (parliamentari, parliament,
legitimaci).
The histogram at Figure 3, shows the topics with highest

Figure 6. Silhouette using bigram and cosine distance for 7737
documents.
number of attributed documents by LDA.
By using bigrams we discovered some other terms with
high relevance (unit states, possibly inferred as United
States, world war, war II), all of them make sense with
the topics discovered by LDA and the most relevant words
found by tf-idf measure at Table 3.
Unfortunately, PAM using cosine did not bring at good
separation of the clusters indicated by silhouette giving neg-
ative widths, possibly indicating that unigrams are the best
approach for this problem.
As a future work is to investigate manners to print several
topics and terms for LDA in a word cloud or graph to make
it easier to visualize the output provided by this clustering
algorithm.
References
[1] Michael Steinbach, George Karypis, and Vipin Kumar. A
comparison of document clustering techniques. Department
of Computer Science and Egineering, University of Min-
nesota, Technical Report, 00(034), 2000. 1
[2] Joydeep Ghosh Alexander Strehl and Raymond Mooney. Im-
pact of similarity measures on web-page clustering. AAAI-
2000: Workshop of Artificial Intelligence for Web Search,
pages 58–64, 2000. 1
[3] Johannes Kepler Bettina Grun and Kurt Hornik. topicmodels:
An r package for fitting topic models. Journal of Statistical
Software, 40(13), 2011. 1
[4] J. A. Hartigan and M. A. Wong. A k-means clustering algo-
rithm. Applied Statistics, 28:100â“108, 1979. 1
[5] T. Calinski and J. Harabasz. A dendrite method for cluster
analysis. Communications in Statistics, 3(1):1â“27, 1979. 1
[6] Brandon M. Stewart Margaret E. Roberts and Dustin Tingley.
stm: R package for structural topic models. 2014. 1
[7] Richards G. de la Iglesia B. Reynolds, A. and V. Rayward-
Smith. Clustering rules: A comparison of partitioning and
hierarchical clustering algorithms. Journal of Mathematical
Modelling and Algorithms, 5:475â“504, 1992. 1

2014-mo444-practical-assignment-02-paulo_faria

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (15)

Similar to 2014-mo444-practical-assignment-02-paulo_faria

Similar to 2014-mo444-practical-assignment-02-paulo_faria (20)

2014-mo444-practical-assignment-02-paulo_faria