The document discusses using clustering techniques like K-means, LDA, and PAM to analyze topics in a large dataset of Wikipedia documents. It explores preprocessing steps, compares different clustering algorithms, and analyzes the results. K-means identified around 250 topics using the elbow method. LDA was able to identify coherent topics based on word co-occurrence. PAM using bigrams found some meaningful word pairs but the clusters did not separate well. The techniques revealed topics related to music, politics, war and more.
Slides for Introductory session on K Means Clustering.
simple and good. ppt
Could be used for taking classes for MCA students on Clustering Algorithms for Data mining.
Prepared By K.T.Thomas HOD of Computer Science, Santhigiri College Vazhithala
Prediction and Explanation over DL-Lite Data StreamsSzymon Klarman
Presentation for the paper:
Szymon Klarman and Thomas Meyer. Prediction and Explanation over DL-Lite Data Streams. In Proceedings of the 19th International Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR-19), 2013.
Hybrid acquisition of temporal scopes for rdf dataAnisa Rula
Information on the temporal interval of validity for facts described by RDF triples plays an important role in a large number of applications. Yet, most of the knowledge bases available on the Web of Data do not provide such information in an explicit manner. In this paper, we present a generic approach which addresses this drawback by inserting temporal information into knowledge bases. Our approach combines two types of information to associate RDF triples with time intervals. First, it relies on temporal information gathered from the document Web by an extension of the fact validation framework DeFacto. Second, it harnesses the time information contained in knowledge bases. This knowledge is combined within a three-step approach which comprises the steps matching,
selection and merging. We evaluate our approach against a corpus of facts gathered from Yago2 by using DBpedia and Freebase as input and different parameter settings for the underlying algorithms. Our results suggest that we can detect temporal information for facts from DBpedia
with an F-measure of up to 70%.
Slides for Introductory session on K Means Clustering.
simple and good. ppt
Could be used for taking classes for MCA students on Clustering Algorithms for Data mining.
Prepared By K.T.Thomas HOD of Computer Science, Santhigiri College Vazhithala
Prediction and Explanation over DL-Lite Data StreamsSzymon Klarman
Presentation for the paper:
Szymon Klarman and Thomas Meyer. Prediction and Explanation over DL-Lite Data Streams. In Proceedings of the 19th International Conference on Logic for Programming, Artificial Intelligence and Reasoning (LPAR-19), 2013.
Hybrid acquisition of temporal scopes for rdf dataAnisa Rula
Information on the temporal interval of validity for facts described by RDF triples plays an important role in a large number of applications. Yet, most of the knowledge bases available on the Web of Data do not provide such information in an explicit manner. In this paper, we present a generic approach which addresses this drawback by inserting temporal information into knowledge bases. Our approach combines two types of information to associate RDF triples with time intervals. First, it relies on temporal information gathered from the document Web by an extension of the fact validation framework DeFacto. Second, it harnesses the time information contained in knowledge bases. This knowledge is combined within a three-step approach which comprises the steps matching,
selection and merging. We evaluate our approach against a corpus of facts gathered from Yago2 by using DBpedia and Freebase as input and different parameter settings for the underlying algorithms. Our results suggest that we can detect temporal information for facts from DBpedia
with an F-measure of up to 70%.
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
** Python Training for Data Science: https://www.edureka.co/python **
This Edureka Machine Learning tutorial (Machine Learning Tutorial with Python Blog: https://goo.gl/fe7ykh ) series presents another video on "K-Means Clustering Algorithm". Within the video you will learn the concepts of K-Means clustering and its implementation using python. Below are the topics covered in today's session:
1. What is Clustering?
2. Types of Clustering
3. What is K-Means Clustering?
4. How does a K-Means Algorithm works?
5. K-Means Clustering Using Python
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
Comparative Analysis of Algorithms for Single Source Shortest Path ProblemCSCJournals
The single source shortest path problem is one of the most studied problem in algorithmic graph theory. Single Source Shortest Path is the problem in which we have to find shortest paths from a source vertex v to all other vertices in the graph. A number of algorithms have been proposed for this problem. Most of the algorithms for this problem have evolved around the Dijkstra’s algorithm. In this paper, we are going to do comparative analysis of some of the algorithms to solve this problem. The algorithms discussed in this paper are- Thorup’s algorithm, augmented shortest path, adjacent node algorithm, a heuristic genetic algorithm, an improved faster version of the Dijkstra’s algorithm and a graph partitioning based algorithm.
Found this paper really interesting. It delves into the learning behaviors of Deep Learning Ensembles and compares them Bayesian Neural Networks, which theoretically does the same thing. This answers why Deep Ensembles Outperform
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...Edureka!
** Python Training for Data Science: https://www.edureka.co/python **
This Edureka Machine Learning tutorial (Machine Learning Tutorial with Python Blog: https://goo.gl/fe7ykh ) series presents another video on "K-Means Clustering Algorithm". Within the video you will learn the concepts of K-Means clustering and its implementation using python. Below are the topics covered in today's session:
1. What is Clustering?
2. Types of Clustering
3. What is K-Means Clustering?
4. How does a K-Means Algorithm works?
5. K-Means Clustering Using Python
Machine Learning Tutorial Playlist: https://goo.gl/UxjTxm
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
Comparative Analysis of Algorithms for Single Source Shortest Path ProblemCSCJournals
The single source shortest path problem is one of the most studied problem in algorithmic graph theory. Single Source Shortest Path is the problem in which we have to find shortest paths from a source vertex v to all other vertices in the graph. A number of algorithms have been proposed for this problem. Most of the algorithms for this problem have evolved around the Dijkstra’s algorithm. In this paper, we are going to do comparative analysis of some of the algorithms to solve this problem. The algorithms discussed in this paper are- Thorup’s algorithm, augmented shortest path, adjacent node algorithm, a heuristic genetic algorithm, an improved faster version of the Dijkstra’s algorithm and a graph partitioning based algorithm.
Found this paper really interesting. It delves into the learning behaviors of Deep Learning Ensembles and compares them Bayesian Neural Networks, which theoretically does the same thing. This answers why Deep Ensembles Outperform
Diminishing Quality of Proposals (Federal) -The Sandwiched Proposal WritersHarnoor Sanjeev
Proposal Writers are sandwiched between Pink, Red and Gold reviews; capture, capture management, federal contracts, federal government, government contracting, Management, proposal coordination, proposal life cycle, proposal management, proposal reviews, proposal writing
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
Fuzzy Inventory Model of Deteriorating Items under Power Dependent Demand and...orajjournal
The present paper deals with the development of a fuzzy inventory model of deteriorating items
under power demand rate and inventory level dependent holding cost function. The deterioration
rate, demand rate, holding cost and unit cost are considered as trapezoidal fuzzy numbers. Both
the crisp model and fuzzy model are developed in this paper. The graded mean integration
method(GM) and signed distance method(SD) are used to defuzzify the total cost of the present
model. Both the models are illustrated by suitable numerical examples and a sensitivity analysis
for the optimal solution towards changes in the system parameters are discussed. Lastly a
graphical presentation is furnished to compare the total costs under the above two mentioned
methods in the fuzzy model.
SVD BASED LATENT SEMANTIC INDEXING WITH USE OF THE GPU COMPUTATIONSijscmcj
The purpose of this article is to determine the usefulness of the Graphics Processing Unit (GPU) calculations used to implement the Latent Semantic Indexing (LSI) reduction of the TERM-BY DOCUMENT matrix. Considered reduction of the matrix is based on the use of the SVD (Singular Value Decomposition) decomposition. A high computational complexity of the SVD decomposition - O(n3), causes that a reduction of a large indexing structure is a difficult task. In this article there is a comparison of the time complexity and accuracy of the algorithms implemented for two different environments. The first environment is associated with the CPU and MATLAB R2011a. The second environment is related to graphics processors and the CULA library. The calculations were carried out on generally available benchmark matrices, which were combined to achieve the resulting matrix of high size. For both considered environments computations were performed for double and single precision data.
This was my final project back in 2009, in the class of Natural Language Processing at the CS department in University of Pittsburgh, PA, USA, class taught by professor Rebecca Hwa.
It has many details on the backup slides about LDA, hyperparameters, how to calculate the distributions based on MLE, etc.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...kevig
The tremendous increase in the amount of available research documents impels researchers to propose
topic models to extract the latent semantic themes of a documents collection. However, how to extract the
hidden topics of the documents collection has become a crucial task for many topic model applications.
Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of
documents collection increases. In this paper, the Correlated Topic Model with variational Expectation-
Maximization algorithm is implemented in MapReduce framework to solve the scalability problem. The
proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of
the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed
approach has a comparable performance in terms of topic coherences with LDA implemented in
MapReduce framework.
IMPROVING SEARCH ENGINES BY DEMOTING NON-RELEVANT DOCUMENTSkevig
A good search engine aims to have more relevant documents on the top of the list. This paper describes a new technique called “Improving search engines by demoting non-relevant documents” (DNR) that improves the precision by detecting and demoting non-relevant documents. DNR generates a new set of queries that are composed of the terms of the original query combined in different ways. The documents retrieved from those new queries are evaluated using a heuristic algorithm to detect the non-relevant ones. These non-relevant documents are moved down the list which will consequently improve the precision. The new technique is tested on WT2g test collection. The testing of the new technique is done using variant retrieval models, which are the vector model based on the TFIDF weighing measure, the probabilistic models based on the BM25, and DFR-BM25 weighing measures. The recall and precision ratios are used to compare the performance of the new technique against the performance of the original query
A good search engine aims to have more relevant documents on the top of the list. This paper describes a new technique called “Improving search engines by demoting non-relevant documents” (DNR) that
improves the precision by detecting and demoting non-relevant documents. DNR generates a new set of
queries that are composed of the terms of the original query combined in different ways
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...IJCSEA Journal
We have attempted in this paper to reduce the number of checked condition through saving frequency of the
tandem replicated words, and also using non-overlapping iterative neighbor intervals on plane sweep
algorithm. The essential idea of non-overlapping iterative neighbor search in a document lies in focusing
the search not on the full space of solutions but on a smaller subspace considering non-overlapping
intervals defined by the solutions. Subspace is defined by the range near the specified minimum keyword.
We repeatedly pick a range up and flip the unsatisfied keywords, so the relevant ranges are detected. The
proposed method tries to improve the plane sweep algorithm by efficiently calculating the minimal group of
words and enumerating intervals in a document which contain the minimum frequency keyword. It
decreases the number of comparison and creates the best state of optimized search algorithm especially in
a high volume of data. Efficiency and reliability are also increased compared to the previous modes of the
technical approach.
As a B.Tech project, developed a faster search method by reducing both space and time thereby enhancing data mining process across various domains of application.
-> Achieved lossless data compression with higher compressibility ratio
-> Performed clustering in compressed domain for classification and pattern matching
-> Applied it on human genome for identification of cancer signatures
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Mapping Subsets of Scholarly InformationPaul Houle
We illustrate the use of machine learning techniques to analyze, structure, maintain,
and evolve a large online corpus of academic literature. An emerging field of research
can be identified as part of an existing corpus, permitting the implementation of a
more coherent community structure for its practitioners.
Propagation of Policies in Rich Data FlowsEnrico Daga
Enrico Daga† Mathieu d’Aquin† Aldo Gangemi‡ Enrico Motta†
† Knowledge Media Institute, The Open University (UK)
‡ Université Paris13 (France) and ISTC-CNR (Italy)
The 8th International Conference on Knowledge Capture (K-CAP 2015)
October 10th, 2015 - Palisades, NY (USA)
http://www.k-cap2015.org/
Similar to 2014-mo444-practical-assignment-02-paulo_faria (20)
1. Topic analysis using clustering techniques
Paulo Renato de Faria∗
Anderson Rocha†
1. Introduction
The following study will explore topic data knowledge
discovery by applying clustering techniques over one large
dataset of documents extracted from Wikipedia.
2. Activities
A comparision of Agglomerative hierarchical clustering
O(n2), K-means and presents bisecting K-means O(n) is
given at Steinbach et al. [1]. It also discusses The Vector
Space Model and Document Clustering providing guidance
tf-idf (term frequency-inverse document frequency), and
how to calculate cluster quality such entropy and F measure
(described in the next session).
The differences among cosine, Euclidean, Pearson and
Jaccard similarity measures are shown in Strehl et al. [2].
The conclusion states that ”metric distances such as Eu-
clidean are not appropriate for high dimensional, sparse do-
mains. Cosine, correlation and extended Jaccard are suc-
cessful in capturing the similarities implicitly indicated by
manual categorizations(...)”
As the newest techniques used, we found what is called
LDA (latent dirichlet allocation) at Grun et al. [3] as a way
to capture the semantics of the words as they appear to-
gether (not only individual counting)
3. Proposed Solutions
We will use a non-hierarchical approach over kmeans in
R using Hartigan and Wong [4]. To improve confidence, the
first seed will use the technique of kmeans nstart, which will
try at least 10 random starts and keep the best. Besides, to
avoid memory issues, we will use simple random sample
(without replacement) to select among a subset of texts and
determine the number of topics k with a reasonable confi-
dence.
∗Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: paulo.faria@gmail.com
†Is with the Institute of Computing, University of Campinas (Uni-
camp). Contact: anderson.rocha@ic.unicamp.br
TF(t) = (NbrtimesTermtAppearsDocumentd)/
(totalTermsDocumentd) (1)
IDF(t) = ln(NbrDocuments/NbrDocumentsWithTermt)
(2)
TF − IDF(t) = TF(t) ∗ IDF(t) (3)
As an evaluation of cluster quality we will use (using a
combination of Intra-cluster and Inter-cluster statistics):
Pseudo−F = (between−cluster−sum−of−squares/(k−1))
/(within − cluster − sum − of − squares/(n − k))
(4)
Pseudo-F was proposed by Calinski and Harabasz [5]
and describes the ratio of between cluster variance to
within-cluster variance. If Pseudo F is decreasing, that
means either the within-cluster variance is increasing or
staying static (denominator) or the between cluster vari-
ance is decreasing (numerator). Within cluster variance re-
ally just measures how tight your clusters fit together. The
higher the number the more dispersed the cluster, the lower
the number the more focused the cluster. Between clus-
ter variance measures how seperated clusters are from each
other.
After finding the number of topics using elbow tech-
nique, we will run LDA (latent dirichlet allocation)that
is a recent technique as pointed in Roberts et al. [6] to
find the topics based on the semantic context and com-
pare the clustering results. Another method that will be
tried is PAM (Partitioning Around Medoids) as provided in
Reynolds et al. [7] and using cosine as a dissimilarity mea-
sure as discussed at Strehl et al. [2]. For PAM we will try
two approaches, using unigram and bigrams (such United
States) to analyse if the clustering gets better results by
application LSA (latent semantic analysis) to partition the
document-term space.
1
2. It was also presented a word cloud and dendogram to
make it easier to see the words present in the overall bag of
words.
4. Experiments and Discussion
The dataset available comes from Wikipedia Documents
and comprises 154,753 documents.
4.1. Data Preprocessing
Firstly, it was removed files starting with dot (as they
did not have content). After that, it was also applied sev-
eral transformations to make it possible to compare differ-
ent documents:
(a) make all words to be lowercase
(b) remove punctuation and numbers
(c) removing english stopwords
(d) stripping whitespaces
(e) english stemming
Although it is not the structure used in the Kmeans, by
using R TermDocument structure it was possible to find
the most frequent terms inside the documents (using 0.2
as a filter). Then, it was created a dendogram (see Figure
1) and it was observed that the majority of the most fre-
quent items were non-relevant words. For this reason, we
added some of these words such as ”also”, ”one”, ”includ”,
”first”,”two”,”three”,”may”,”although”,”throughout”.
Due to memory restrictions it was not possible to find
the total number of terms in the entire set of 150000 docu-
ments. So, firstly we found the number of terms in a simple
random sampling (without replacement) of documents and
keep the result on a document-term matrix structure, and
applying tf-idf normalized, on Table 1):
sample rate % Nbr documents Nbr terms
100 154753 58796
37 56799 35214
30 47260 33047
10 15475 19013
7 10492 13561
5 7737 9806
1 1530 718
Table 1. Overall view of sampling and terms
After that, some tests were done to see how terms are
filtered by applying the removal of sparse items (Table 2):
Figure 1. Dendogram of the most frequent words.
sparsity Nbr documents Total Terms Nbr filtered terms
0.999 15475 19013 12030
0.999 10492 13561 12302
0.999 7737 9806 9806 (no filter)
0.9975 47620 33047 3060
0.9975 15475 19013 3200
0.9975 10492 13561 3111
0.9975 7737 9806 3330
0.99 15475 19013 360
0.99 7737 9806 365
0.98 154753 58796 4752
0.975 15475 19013 139
0.975 7737 9806 141
0.95 154753 58796 2448
0.95 10492 13561 70
0.9 154753 58796 1305
0.9 15475 19013 23
0.9 10492 13561 22
0.9 7737 9806 23
0.8 154753 58796 503
Table 2. Removing non-relevant items based on sparsity
4.2. Kmeans "Elbow" technique
These numbers suggests that from sparse ¡0.9975 the
number of relevant terms remains constant indicating a rela-
tively stable set of the most relevant terms for tf-idf . Based
on the number of terms, we decided to try to find k us-
ing ”elbow” technique for sparsity for 3200 terms (sparsity
0.9975, that is almost the same result as aplying 0.98 for the
3. entire dataset).
To overcome memory limitations we need to call rm()
as soon as variables becomes unused and gc(). Only this
enable to run ”elbow” technique (we used power 2 to make a
sample until 1024 clusters) whose result is presented below
at Figure 2:
Figure 2. Plotting divergence with respect to the possible number
of groups (applying power 2 to reduce sampling and enable exe-
cution).
By the picture the suggested ”elbow” is around 250 [225
to 275 topics] as a 10-percent error margin.
The most frequent words found using tf-idf were the fol-
lowing (using filter 0.9) at Table 3:
4.2.1 Kmeans: Intra and Inter-cluster statistics
The statistics found for Kmeans for n=7737 documents,
3357 terms, k= 250 are in Table 4:
4.3. LDA - Latent Dirichlet Allocation
The following code was used to discover the topics based
on latent dirichlet allocation (the algorithm used tf as input
instead of tf-idf but it take semantic relationships among the
words inside the documents to identify the topics)
lda <− LDA( docs term t f . new , k ,
control = l i s t ( alpha = 0.1 , seed =2010))
lda@k
( terms <− terms ( lda , 3 ) )
( t o p i c s <− t o p i c s ( lda ) )
wordColumn1 wordColumn1 wordColumn3 wordColumn4
agricultur anglosaxon anniversari archbishop
architectur australian autobiographi background
battlecruis battleship borchgrevink california
canterburi championship characterist commission
commonwealth constantin contemporari controversi
correspond counterattack cunningham difficulti
distinguish documentari elagabalus epaminonda
experiment indonesian instrument kinetoscop
legislatur manufactur manuscript massachusett
mediterranean metropolitan millennium nevertheless
observatori palestinian parliament parliamentari
particular pennsylvania philadelphia photograph
pittsburgh profession relationship republican
settlement soundtrack springfield strengthen
temperatur thoroughbr tournament underground
understand vegetarian washington widespread
youngstown
Table 3. 65 most frequent words using tf-idf
Within-cluster SS Between-cluster SS Pseudo-F
7864.74 15607.05 59.66847
Table 4. K-means statistics
Below a summary of the first 15 topics (clusters) found
and showing the top 3 words inside (see Table 5):
The following histogram shows the frequency of the
number of documents attributed for each topic (Figure 3):
Figure 3. Distribution of documents over the topics.
4. Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
confederaci soundtrack pittsburgh archbishop parliament twickenham counterattack leadership
profession documentari throughout predecessor underground reconstruct breakthrough mainstream
particular screenplay scoreboard canterburi counsellor pedestrian reichswehr ambassador
Topic 9 Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15
metropolitan constantinopl parliamentari correspond temperatur leopardskin spacecraft
librettist particular parliament transcendent declassifi photograph temperatur
spinelloccio myriokephalon legitimaci mathematician decontamin albumdylan perihelion
Table 5. LDA first 15 topics
4.4. PAM using LSA
The last algorithm we tried was PAM (Partitioning
Around Medoids). The results are discussed below:
4.4.1 PAM using bigrams
Differently from the other algorithms it was tried to use a
TermDocument structure extracting bigrams instead of un-
igrams. The following dendogram (Figure 4) summarizes
the most frequent terms:
Figure 4. Dendogram for 20 terms and 7737 documents.
The result can be also viewed in a word cloud (Figure 5):
With this data it was applied PAM using Latent Seman-
tic Analysis to separate the TermDocument structure. The
dissimilarity metric adopted was 1 - cosine of the lsa space.
To enable comparision it was used the same parameters as
the other algorithms (specially k=250 topics). The silhou-
ette in Figure 6 shows the clusters and its width. As width
values are negative, indicating some error (wrong cluster
placement).
Figure 5. Word cloud for 492 terms and 7737 documents.
5. Conclusions and Future Work
Given Kmeans statistics at Table 4, we can see that the
sums of squares inside the cluster is lower than between
cluster sums of squares as expected to achieve a good sepa-
ration among clusters.
Analysing the results at Figure 2, the number of topics
found was around 250 (225-275 topics) using ”elbow” rule
and an error margin of 10-percent.
By using LDA algorithm it was easier to identify the
clusters and the content (3 most relevant terms inside)
which make sense such as topic 2 more related with mu-
sic/movies (soundtrack, documentari, screenplay), there are
some topics not shown due to lack of space such as topic
130 (tchaikovski, petersburg, conservatori) related with the
same major topic but different content. There are some
topics whose major topic is about war specially such topic
7 (counterattack, breakthrough, reichswehr), others about
politics, for example, topic 11 (parliamentari, parliament,
legitimaci).
The histogram at Figure 3, shows the topics with highest
5. Figure 6. Silhouette using bigram and cosine distance for 7737
documents.
number of attributed documents by LDA.
By using bigrams we discovered some other terms with
high relevance (unit states, possibly inferred as United
States, world war, war II), all of them make sense with
the topics discovered by LDA and the most relevant words
found by tf-idf measure at Table 3.
Unfortunately, PAM using cosine did not bring at good
separation of the clusters indicated by silhouette giving neg-
ative widths, possibly indicating that unigrams are the best
approach for this problem.
As a future work is to investigate manners to print several
topics and terms for LDA in a word cloud or graph to make
it easier to visualize the output provided by this clustering
algorithm.
References
[1] Michael Steinbach, George Karypis, and Vipin Kumar. A
comparison of document clustering techniques. Department
of Computer Science and Egineering, University of Min-
nesota, Technical Report, 00(034), 2000. 1
[2] Joydeep Ghosh Alexander Strehl and Raymond Mooney. Im-
pact of similarity measures on web-page clustering. AAAI-
2000: Workshop of Artificial Intelligence for Web Search,
pages 58–64, 2000. 1
[3] Johannes Kepler Bettina Grun and Kurt Hornik. topicmodels:
An r package for fitting topic models. Journal of Statistical
Software, 40(13), 2011. 1
[4] J. A. Hartigan and M. A. Wong. A k-means clustering algo-
rithm. Applied Statistics, 28:100ˆa“108, 1979. 1
[5] T. Calinski and J. Harabasz. A dendrite method for cluster
analysis. Communications in Statistics, 3(1):1ˆa“27, 1979. 1
[6] Brandon M. Stewart Margaret E. Roberts and Dustin Tingley.
stm: R package for structural topic models. 2014. 1
[7] Richards G. de la Iglesia B. Reynolds, A. and V. Rayward-
Smith. Clustering rules: A comparison of partitioning and
hierarchical clustering algorithms. Journal of Mathematical
Modelling and Algorithms, 5:475ˆa“504, 1992. 1