Paper by Mark M. Hall, Mark Stevenson and Paul D. Clough from the Information School /Department of Computer Science, University of Sheffield, UK
24-27 September 2012
TPDL 2012, Cyprus
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
This document summarizes a research paper that evaluates cluster quality using a modified density subspace clustering approach. It discusses how density subspace clustering can be used to identify clusters in high-dimensional datasets by detecting density-connected clusters in all subspaces. The proposed approach uses a density subspace clustering algorithm to select attribute subsets and identify the best clusters. It then calculates intra-cluster and inter-cluster distances to evaluate cluster quality and compares the results to other clustering algorithms in terms of accuracy and runtime. Experimental results showed that the proposed method improves clustering quality and performs faster than existing techniques.
This document provides an overview of several clustering algorithms. It begins by defining clustering and its importance in data mining. It then categorizes clustering algorithms into four main types: partitional, hierarchical, grid-based, and density-based. For each type, some representative algorithms are described briefly. The document also reviews several popular clustering algorithms like k-means, CLARA, PAM, CLARANS, and BIRCH in more detail. It discusses aspects like the algorithms' time complexity, types of data handled, ability to detect clusters of different shapes, required input parameters, and advantages/disadvantages. Overall, the document aims to guide selection of suitable clustering algorithms for specific applications by surveying their key characteristics.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
The document presents a new approach called Bat-Cluster (BC) for automated graph clustering. BC combines the Fast Fourier Domain Positioning (FFDP) algorithm and the Bat Algorithm. FFDP positions graph nodes, then Bat Algorithm optimizes clustering by finding configurations that minimize the Davies-Bouldin Index. BC is tested on four benchmark graphs and outperforms Particle Swarm Optimization, Ant Colony Optimization, and Differential Evolution in providing higher clustering precision.
This document summarizes a study on using the Mahout machine learning library to perform K-means clustering on large datasets using Hadoop. The study tested the performance of Mahout K-means clustering on Amazon EC2 instances using a 1.1GB network intrusion detection dataset. The results showed that Mahout was able to scale to utilize multiple nodes and significantly reduce clustering time as the dataset and number of nodes increased. Specifically, clustering time was reduced by over 350% when using 5 nodes compared to a single node for the full 1.1GB dataset. The quality of the clustering, as measured by the sum of squared distances between centroids, was maintained as Mahout was able to leverage Hadoop and multiple nodes to efficiently perform
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
This document proposes a new multi-viewpoint based similarity measure for clustering text documents that aims to overcome limitations of existing measures. Existing measures use a single viewpoint to measure similarity between documents, but the proposed measure uses multiple viewpoints to ensure clusters exhibit all relationships between documents. The empirical study found that using a multi-viewpoint similarity measure forms more meaningful clusters by capturing more informative relationships between documents.
Nature Inspired Models And The Semantic WebStefan Ceriu
In this paper we present a series of nature inspired models used as alternative solutions for Semantic Web concerns. Some of the methods presented in this article perform better than classic algorithms by enhancing response time and computational costs. Others are just proof of concept, first steps towards new techniques that will improve their respective field. The intricate nature of the Semantic Web urges the need for faster, more intelligent algorithms and nature inspired models have been proven to be more than suitable for such complex tasks.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
This document summarizes a research paper that evaluates cluster quality using a modified density subspace clustering approach. It discusses how density subspace clustering can be used to identify clusters in high-dimensional datasets by detecting density-connected clusters in all subspaces. The proposed approach uses a density subspace clustering algorithm to select attribute subsets and identify the best clusters. It then calculates intra-cluster and inter-cluster distances to evaluate cluster quality and compares the results to other clustering algorithms in terms of accuracy and runtime. Experimental results showed that the proposed method improves clustering quality and performs faster than existing techniques.
This document provides an overview of several clustering algorithms. It begins by defining clustering and its importance in data mining. It then categorizes clustering algorithms into four main types: partitional, hierarchical, grid-based, and density-based. For each type, some representative algorithms are described briefly. The document also reviews several popular clustering algorithms like k-means, CLARA, PAM, CLARANS, and BIRCH in more detail. It discusses aspects like the algorithms' time complexity, types of data handled, ability to detect clusters of different shapes, required input parameters, and advantages/disadvantages. Overall, the document aims to guide selection of suitable clustering algorithms for specific applications by surveying their key characteristics.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Bat-Cluster: A Bat Algorithm-based Automated Graph Clustering Approach IJECEIAES
The document presents a new approach called Bat-Cluster (BC) for automated graph clustering. BC combines the Fast Fourier Domain Positioning (FFDP) algorithm and the Bat Algorithm. FFDP positions graph nodes, then Bat Algorithm optimizes clustering by finding configurations that minimize the Davies-Bouldin Index. BC is tested on four benchmark graphs and outperforms Particle Swarm Optimization, Ant Colony Optimization, and Differential Evolution in providing higher clustering precision.
This document summarizes a study on using the Mahout machine learning library to perform K-means clustering on large datasets using Hadoop. The study tested the performance of Mahout K-means clustering on Amazon EC2 instances using a 1.1GB network intrusion detection dataset. The results showed that Mahout was able to scale to utilize multiple nodes and significantly reduce clustering time as the dataset and number of nodes increased. Specifically, clustering time was reduced by over 350% when using 5 nodes compared to a single node for the full 1.1GB dataset. The quality of the clustering, as measured by the sum of squared distances between centroids, was maintained as Mahout was able to leverage Hadoop and multiple nodes to efficiently perform
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
This document proposes a new multi-viewpoint based similarity measure for clustering text documents that aims to overcome limitations of existing measures. Existing measures use a single viewpoint to measure similarity between documents, but the proposed measure uses multiple viewpoints to ensure clusters exhibit all relationships between documents. The empirical study found that using a multi-viewpoint similarity measure forms more meaningful clusters by capturing more informative relationships between documents.
Nature Inspired Models And The Semantic WebStefan Ceriu
In this paper we present a series of nature inspired models used as alternative solutions for Semantic Web concerns. Some of the methods presented in this article perform better than classic algorithms by enhancing response time and computational costs. Others are just proof of concept, first steps towards new techniques that will improve their respective field. The intricate nature of the Semantic Web urges the need for faster, more intelligent algorithms and nature inspired models have been proven to be more than suitable for such complex tasks.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Improve the Performance of Clustering Using Combination of Multiple Clusterin...ijdmtaiir
The ever-increasing availability of textual
documents has lead to a growing challenge for information
systems to effectively manage and retrieve the information
comprised in large collections of texts according to the user’s
information needs. There is no clustering method that can
adequately handle all sorts of cluster structures and properties
(e.g. shape, size, overlapping, and density). Combining
multiple clustering methods is an approach to overcome the
deficiency of single algorithms and further enhance their
performances. A disadvantage of the cluster ensemble is the
highly computational load of combing the clustering results
especially for large and high dimensional datasets. In this paper
we propose a multiclustering algorithm , it is a combination of
Cooperative Hard-Fuzzy Clustering model based on
intermediate cooperation between the hard k-means (KM) and
fuzzy c-means (FCM) to produce better intermediate clusters
and ant colony algorithm. This proposed method gives better
result than individual clusters.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
The document summarizes a proposed clustering algorithm for high dimensional data that combines hierarchical (H-K) clustering, subspace clustering, and ensemble clustering. It begins with background on challenges of clustering high dimensional data and related work applying dimension reduction, subspace clustering, ensemble clustering, and H-K clustering individually. The proposed model first applies subspace clustering to identify clusters within subsets of features. It then performs H-K clustering on each subspace cluster. Finally, it applies ensemble clustering techniques to integrate the results into a single clustering. The goal is to leverage each technique's strengths to improve clustering performance for high dimensional data compared to using a single approach.
This document discusses semi-supervised text classification using unlabeled data called "Universum". Semi-supervised learning uses both labeled and unlabeled data for training to improve accuracy over supervised learning, which only uses labeled data. The document proposes using unlabeled "Universum" examples that do not belong to any class of interest along with labeled examples. Experimental results on Reuters datasets show the proposed algorithm can benefit from Universum examples, especially when the number of labeled examples is insufficient.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
This document summarizes Chapter 10 of the book "Data Mining: Concepts and Techniques (3rd ed.)" which covers cluster analysis. The chapter introduces different types of clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods, density-based methods, and grid-based methods. It discusses how to evaluate the quality of clustering results and highlights considerations for cluster analysis such as similarity measures, clustering space, and challenges like scalability and high dimensionality.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
An Empirical Study for Defect Prediction using Clusteringidescitation
Reliably predicting defects in the software is one of
the holy grails of software engineering. Researchers have
devised and implemented a method of defect prediction
approaches varying in terms of accuracy, complexity, and the
input data they require. An accurate prediction of the number
of defects in a software product during system testing
contributes not only to the management of the system testing
process but also to the estimation of the product’s required
maintenance [1]. A prediction of the number of remaining
defects in an inspected artefact can be used for decision making.
Defective software modules cause software failures, increase
development and maintenance costs, and decrease customer
satisfaction. It strives to improve software quality and testing
efficiency by constructing predictive models from code
attributes to enable a timely identification of fault-prone
modules [2]. In this paper, we will discuss clustering techniques
are used for software defect prediction. This helps the
developers to detect software defects and correct them.
Unsupervised techniques may be used for defect prediction in
software modules, more so in those cases where defect labels
are not available [3].
In this paper we describe NoSQL, a series of non-relational database technologies and products developed to address the current problems the RDMS system are facing: lack of true scalability, poor performance on high data volumes and low availability. Some of these products have already been involved in production and they perform very well: Amazon’s Dynamo, Google’s Bigtable, Cassandra, etc. Also we provide a view on how these systems influence the applications development in the social and semantic Web sphere.
An Iterative Improved k-means ClusteringIDES Editor
This document presents a new iterative improved k-means clustering algorithm.
The k-means clustering algorithm is widely used but depends on random initial starting points, which can impact the results. The new algorithm aims to provide better initial starting points to improve k-means clustering results.
The algorithm divides the data into K initial groups, calculates new cluster centers iteratively using a distance-based formula, assigns data points to clusters, and repeats until cluster centers no longer change. Experimental results on several datasets show the new algorithm converges in fewer iterations than standard k-means, demonstrating it finds better cluster solutions.
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
The document describes a research article that proposes applying a parallel glowworm swarm optimization algorithm for clustering large unstructured data sets. The algorithm uses optimized glowworm swarms to evaluate the clustering problem and find multiple cluster centroids. It employs the MapReduce framework for parallelization, which balances the load, localizes the data, and provides fault tolerance. Experiments showed the algorithm scales well with increasing data set sizes and achieves near-linear speed while maintaining high-quality clustering, demonstrating it is more efficient than traditional algorithms for clustering large unstructured data.
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
Databases are build with the fixed number of fields and records. Uncertain database contains a
different number of fields and records. Clustering techniques are used to group up the relevant records
based on the similarity values. The similarity measures are designed to estimate the relationship between
the transactions with fixed attributes. The uncertain data similarity is estimated using similarity
measures with some modifications.
Clustering on uncertain data is one of the essential tasks in mining uncertain data. The existing
methods extend traditional partitioning clustering methods like k-means and density-based clustering
methods like DBSCAN to uncertain data. Such methods cannot handle uncertain objects. Probability
distributions are essential characteristics of uncertain objects have not been considered in measuring
similarity between uncertain objects.
The customer purchase transaction data is analyzed using uncertain data clustering scheme. The
density based clustering mechanism is used for the uncertain data clustering process. This model
produces results with minimum accuracy levels. The clustering technique is improved with distribution
based similarity model for uncertain data. The nearest neighbor search technique is applied on the
distribution based data environment. The system is designed using java as a front end and oracle as a
back end.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
The document discusses feature subset selection for high dimensional data using clustering techniques. It proposes the FAST algorithm which has three steps: 1) remove irrelevant features, 2) divide features into clusters using DBSCAN, and 3) select the most representative feature from each cluster. DBSCAN is a density-based clustering algorithm that can identify clusters of varying densities and detect outliers. The FAST algorithm is evaluated to select a small number of discriminative features from high dimensional data in an efficient manner. It aims to remove irrelevant and redundant features to improve predictive accuracy while handling large feature sets.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document discusses distance similarity measures that can be used for data mining classification and clustering techniques. It proposes a novel distance similarity measure called "Supervised & Unsupervised learning" that uses Euclidean distance similarity to partition training data into clusters. It then builds decision trees on each cluster to improve classification performance. The document also discusses using these measures for other applications like image processing, where k-means clustering can be used to segment images into clusters of similar pixel intensities. In conclusion, it states these similarity measures can help analyze complex datasets for business analysis purposes.
Iaetsd a survey on one class clusteringIaetsd Iaetsd
This document presents a new method for performing one-to-many data linkage called the One Class Clustering Tree (OCCT). The OCCT builds a tree structure with inner nodes representing features of the first dataset and leaves representing similar features of the second dataset. It uses splitting criteria and pruning methods to perform the data linkage more accurately than existing indexing techniques. The OCCT approach induces a decision tree using a splitting criteria and performs prepruning to determine which branches to trim. It then compares entities to match them between the two datasets and produces a final result.
Simplicial closure & higher-order link predictionAustin Benson
The document discusses higher-order link prediction in networks. It summarizes previous work representing higher-order interactions as tensors, hypergraphs, etc. It then proposes evaluating models of higher-order data using "higher-order link prediction" to predict which groups of more than two nodes will interact based on past data. The authors analyze dynamics of triadic closure in several real-world networks and propose methods to predict closure based on structural properties like edge weights.
The document discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. It provides details on popular partitioning algorithms like k-means and k-medoids, describing how they work, their strengths and weaknesses. Hierarchical clustering methods like AGNES and DIANA are also covered, including how distances between clusters are calculated during the merging or splitting process.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
IND-2012-300 Mother's Pet Kindergarten Nagpur - A U trurn for traffic Rulesdesignforchangechallenge
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
Public san certificate deprecated support in the internal server name part ...Eyal Doron
The document discusses changes to the standards for public SAN certificates by the CA/Browser Forum. As of November 1, 2015, certificates will no longer be able to include non-unique names, also called internal server names, that cannot be externally verified. This is because such names pose a security risk, as they are not truly unique and could be falsely represented. The changes will affect services that currently use certificates for internal server names, requiring administrators to prepare by obtaining new certificates without such names before the deadline. Failure to do so could result in issues ranging from annoyances to catastrophic consequences for their systems and services.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Improve the Performance of Clustering Using Combination of Multiple Clusterin...ijdmtaiir
The ever-increasing availability of textual
documents has lead to a growing challenge for information
systems to effectively manage and retrieve the information
comprised in large collections of texts according to the user’s
information needs. There is no clustering method that can
adequately handle all sorts of cluster structures and properties
(e.g. shape, size, overlapping, and density). Combining
multiple clustering methods is an approach to overcome the
deficiency of single algorithms and further enhance their
performances. A disadvantage of the cluster ensemble is the
highly computational load of combing the clustering results
especially for large and high dimensional datasets. In this paper
we propose a multiclustering algorithm , it is a combination of
Cooperative Hard-Fuzzy Clustering model based on
intermediate cooperation between the hard k-means (KM) and
fuzzy c-means (FCM) to produce better intermediate clusters
and ant colony algorithm. This proposed method gives better
result than individual clusters.
A h k clustering algorithm for high dimensional data using ensemble learningijitcs
The document summarizes a proposed clustering algorithm for high dimensional data that combines hierarchical (H-K) clustering, subspace clustering, and ensemble clustering. It begins with background on challenges of clustering high dimensional data and related work applying dimension reduction, subspace clustering, ensemble clustering, and H-K clustering individually. The proposed model first applies subspace clustering to identify clusters within subsets of features. It then performs H-K clustering on each subspace cluster. Finally, it applies ensemble clustering techniques to integrate the results into a single clustering. The goal is to leverage each technique's strengths to improve clustering performance for high dimensional data compared to using a single approach.
This document discusses semi-supervised text classification using unlabeled data called "Universum". Semi-supervised learning uses both labeled and unlabeled data for training to improve accuracy over supervised learning, which only uses labeled data. The document proposes using unlabeled "Universum" examples that do not belong to any class of interest along with labeled examples. Experimental results on Reuters datasets show the proposed algorithm can benefit from Universum examples, especially when the number of labeled examples is insufficient.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
This document summarizes Chapter 10 of the book "Data Mining: Concepts and Techniques (3rd ed.)" which covers cluster analysis. The chapter introduces different types of clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods, density-based methods, and grid-based methods. It discusses how to evaluate the quality of clustering results and highlights considerations for cluster analysis such as similarity measures, clustering space, and challenges like scalability and high dimensionality.
O NTOLOGY B ASED D OCUMENT C LUSTERING U SING M AP R EDUCE ijdms
Nowadays, document clustering is considered as a da
ta intensive task due to the dramatic, fast increas
e in
the number of available documents. Nevertheless, th
e features that represent those documents are also
too
large. The most common method for representing docu
ments is the vector space model, which represents
document features as a bag of words and does not re
present semantic relations between words. In this
paper we introduce a distributed implementation for
the bisecting k-means using MapReduce programming
model. The aim behind our proposed implementation i
s to solve the problem of clustering intensive data
documents. In addition, we propose integrating the
WordNet ontology with bisecting k-means in order to
utilize the semantic relations between words to enh
ance document clustering results. Our presented
experimental results show that using lexical catego
ries for nouns only enhances internal evaluation
measures of document clustering; and decreases the
documents features from thousands to tens features.
Our experiments were conducted using Amazon ElasticMapReduce to deploy the Bisecting k-means
algorithm
An Empirical Study for Defect Prediction using Clusteringidescitation
Reliably predicting defects in the software is one of
the holy grails of software engineering. Researchers have
devised and implemented a method of defect prediction
approaches varying in terms of accuracy, complexity, and the
input data they require. An accurate prediction of the number
of defects in a software product during system testing
contributes not only to the management of the system testing
process but also to the estimation of the product’s required
maintenance [1]. A prediction of the number of remaining
defects in an inspected artefact can be used for decision making.
Defective software modules cause software failures, increase
development and maintenance costs, and decrease customer
satisfaction. It strives to improve software quality and testing
efficiency by constructing predictive models from code
attributes to enable a timely identification of fault-prone
modules [2]. In this paper, we will discuss clustering techniques
are used for software defect prediction. This helps the
developers to detect software defects and correct them.
Unsupervised techniques may be used for defect prediction in
software modules, more so in those cases where defect labels
are not available [3].
In this paper we describe NoSQL, a series of non-relational database technologies and products developed to address the current problems the RDMS system are facing: lack of true scalability, poor performance on high data volumes and low availability. Some of these products have already been involved in production and they perform very well: Amazon’s Dynamo, Google’s Bigtable, Cassandra, etc. Also we provide a view on how these systems influence the applications development in the social and semantic Web sphere.
An Iterative Improved k-means ClusteringIDES Editor
This document presents a new iterative improved k-means clustering algorithm.
The k-means clustering algorithm is widely used but depends on random initial starting points, which can impact the results. The new algorithm aims to provide better initial starting points to improve k-means clustering results.
The algorithm divides the data into K initial groups, calculates new cluster centers iteratively using a distance-based formula, assigns data points to clusters, and repeats until cluster centers no longer change. Experimental results on several datasets show the new algorithm converges in fewer iterations than standard k-means, demonstrating it finds better cluster solutions.
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
The document describes a research article that proposes applying a parallel glowworm swarm optimization algorithm for clustering large unstructured data sets. The algorithm uses optimized glowworm swarms to evaluate the clustering problem and find multiple cluster centroids. It employs the MapReduce framework for parallelization, which balances the load, localizes the data, and provides fault tolerance. Experiments showed the algorithm scales well with increasing data set sizes and achieves near-linear speed while maintaining high-quality clustering, demonstrating it is more efficient than traditional algorithms for clustering large unstructured data.
Distribution Similarity based Data Partition and Nearest Neighbor Search on U...Editor IJMTER
Databases are build with the fixed number of fields and records. Uncertain database contains a
different number of fields and records. Clustering techniques are used to group up the relevant records
based on the similarity values. The similarity measures are designed to estimate the relationship between
the transactions with fixed attributes. The uncertain data similarity is estimated using similarity
measures with some modifications.
Clustering on uncertain data is one of the essential tasks in mining uncertain data. The existing
methods extend traditional partitioning clustering methods like k-means and density-based clustering
methods like DBSCAN to uncertain data. Such methods cannot handle uncertain objects. Probability
distributions are essential characteristics of uncertain objects have not been considered in measuring
similarity between uncertain objects.
The customer purchase transaction data is analyzed using uncertain data clustering scheme. The
density based clustering mechanism is used for the uncertain data clustering process. This model
produces results with minimum accuracy levels. The clustering technique is improved with distribution
based similarity model for uncertain data. The nearest neighbor search technique is applied on the
distribution based data environment. The system is designed using java as a front end and oracle as a
back end.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Feature Subset Selection for High Dimensional Data Using Clustering TechniquesIRJET Journal
The document discusses feature subset selection for high dimensional data using clustering techniques. It proposes the FAST algorithm which has three steps: 1) remove irrelevant features, 2) divide features into clusters using DBSCAN, and 3) select the most representative feature from each cluster. DBSCAN is a density-based clustering algorithm that can identify clusters of varying densities and detect outliers. The FAST algorithm is evaluated to select a small number of discriminative features from high dimensional data in an efficient manner. It aims to remove irrelevant and redundant features to improve predictive accuracy while handling large feature sets.
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
This document discusses distance similarity measures that can be used for data mining classification and clustering techniques. It proposes a novel distance similarity measure called "Supervised & Unsupervised learning" that uses Euclidean distance similarity to partition training data into clusters. It then builds decision trees on each cluster to improve classification performance. The document also discusses using these measures for other applications like image processing, where k-means clustering can be used to segment images into clusters of similar pixel intensities. In conclusion, it states these similarity measures can help analyze complex datasets for business analysis purposes.
Iaetsd a survey on one class clusteringIaetsd Iaetsd
This document presents a new method for performing one-to-many data linkage called the One Class Clustering Tree (OCCT). The OCCT builds a tree structure with inner nodes representing features of the first dataset and leaves representing similar features of the second dataset. It uses splitting criteria and pruning methods to perform the data linkage more accurately than existing indexing techniques. The OCCT approach induces a decision tree using a splitting criteria and performs prepruning to determine which branches to trim. It then compares entities to match them between the two datasets and produces a final result.
Simplicial closure & higher-order link predictionAustin Benson
The document discusses higher-order link prediction in networks. It summarizes previous work representing higher-order interactions as tensors, hypergraphs, etc. It then proposes evaluating models of higher-order data using "higher-order link prediction" to predict which groups of more than two nodes will interact based on past data. The authors analyze dynamics of triadic closure in several real-world networks and propose methods to predict closure based on structural properties like edge weights.
The document discusses different clustering methods including partitioning, hierarchical, density-based, and grid-based approaches. It provides details on popular partitioning algorithms like k-means and k-medoids, describing how they work, their strengths and weaknesses. Hierarchical clustering methods like AGNES and DIANA are also covered, including how distances between clusters are calculated during the merging or splitting process.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
IND-2012-300 Mother's Pet Kindergarten Nagpur - A U trurn for traffic Rulesdesignforchangechallenge
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
Public san certificate deprecated support in the internal server name part ...Eyal Doron
The document discusses changes to the standards for public SAN certificates by the CA/Browser Forum. As of November 1, 2015, certificates will no longer be able to include non-unique names, also called internal server names, that cannot be externally verified. This is because such names pose a security risk, as they are not truly unique and could be falsely represented. The changes will affect services that currently use certificates for internal server names, requiring administrators to prepare by obtaining new certificates without such names before the deadline. Failure to do so could result in issues ranging from annoyances to catastrophic consequences for their systems and services.
Mail migration to office 365 optimizing the mail migration throughput - par...Eyal Doron
This document provides recommendations for optimizing the throughput of migrating email from an on-premises Exchange server to Office 365. It discusses factors like available bandwidth, network infrastructure tuning, Exchange server performance, and using multiple Exchange sites in parallel. Specific tips include using non-work hours for migration, checking for bottlenecks, adjusting the MRSProxy resource allocation, and taking advantage of hybrid migration methods to reduce migrated data size. The overall goal is to understand limitations and make adjustments to complete large-scale email migrations within tight windows.
Comparing taxonomies for organising collections of documentspathsproject
This document compares different taxonomies for organizing large collections of documents. It examines four existing manually created taxonomies (Library of Congress Subject Headings, WordNet Domains, Wikipedia Taxonomy, DBpedia) and two methods for automatically deriving taxonomies (WikiFreq and LDA topics) for organizing a large online cultural heritage collection from Europeana. It then presents two human evaluations of the taxonomies, measuring cohesion and analyzing concept relations, and finds that the manual taxonomies have high-quality relations while the novel automatic method generates very high cohesion.
Semantic Enrichment of Cultural Heritage content in PATHSpathsproject
Semantic Enrichment of Cultural Heritage content in PATHS, report by Mark Stevenson and Arantxa Otegi with Eneko Agirre, Nikos Aletras, Paul Clough, Samuel Fernando and Aitor Saroa.
The aim of the PATHS project is to enable exploration and discovery within cultural heritage collections. In order to support this the project developed a range of enrichment techniques which augmented these collections with additional information to enhance the users’ browsing experience. One of the demonstration systems developed in PATHS makes use of content from Europeana. This document summarises the semantic enrichment techniques developed in PATHS, with particular reference to their application to the Europeana data.
My E-mail appears as spam - Troubleshooting path | Part 11#17Eyal Doron
My E-mail appears as spam - Troubleshooting path | Part 11#17
http://o365info.com/my-e-mail-appears-as-spam-troubleshooting-path-part-11-17
Troubleshooting scenario of internal \ outbound spam in Office 365 and Exchange Online environment.
Verifying if our domain name is blacklisted, verifying if the problem is related to E-mail content, verifying if the problem is related to specific organization user E-mail address, Moving the troubleshooting process to the “other side.
Eyal Doron | o365info.com
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
This document introduces distributional semantic similarity methods for automatically measuring the coherence of topics generated by topic models. It constructs semantic spaces to represent topic words using Wikipedia as a reference corpus. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information. Topic coherence is determined by measuring the distance between word vectors. Evaluation on three datasets shows distributional measures outperform the state-of-the-art approach, with performance improving using a reduced semantic space.
Client protocol connectivity flow in Exchange 2013/2007 coexistence | Introduction and basic concepts| 1/4 | 16#23
http://o365info.com/client-protocol-connectivity-flow-in-exchange-2013-2007-coexistence-environment-introduction-and-basic-concepts-14/
Reviewing the subject of – client protocol connectivity flow, in an Exchange 2013/2007 coexistence environment (this is the first article, in a series of four articles).
Eyal Doron | o365info.com
Exchange Public infrastructure | Public versus non-Public facing Exchange sit...Eyal Doron
Exchange Public infrastructure | Public versus non-Public facing Exchange site | 5#23
http://o365info.com/exchange-public-infrastructure-public-versus-non-public-facing-exchange-site/
Reviewing the concept of - Public versus non Public facing Exchange.
Eyal Doron | o365info.com
This report accompanies a deliverable from the PATHS project which processed and enriched content from cultural heritage collections for use in a prototype. It describes the collections included from Europeana and a private collection. The content was analyzed using natural language processing and enriched with terms from controlled vocabularies. Links between items within and across collections were also computed based on text similarity. The deliverable contains the processed data along with documentation of the methodologies used to enrich and link the cultural heritage content.
This document describes the customer contact management services provided by Arema, including email handling, live chat support, mobile response management, social media customer support, and multi-channel customer support. Arema helps customers provide excellent support across multiple channels while typically achieving 66% savings. They use the latest technologies to increase efficiency and reduce costs without compromising on quality, being ISO 9001 certified. Arema serves customers in industries such as retail, e-commerce, financial services, healthcare, and more.
A digital id or digital certificate consists of a public and private key pair that can be used to encrypt documents and files so that only the intended recipient can read them, and to digitally sign documents to verify the sender and ensure the document hasn't been altered. The document discusses how public and private keys work, with the public key used to encrypt and the private key needed to decrypt. It states that a digital signature guarantees the identity of the sender and confirms the document hasn't been modified during transmission.
Zorg Dichtbij: antwoord op sterk oplopende uitgaven, ineffectiviteit en hervo...BMC
Dit whitepaper beschrijft de visie van BMC op het onderwerp Zorg Dichtbij en biedt opdrachtgevers in de lokale overheid handvatten bij de uitvoering van hun groeiend takenpakket en hun hernieuwde rol in de organisatie en sturing van zorg en ondersteuning. We bundelen actuele informatie met onze praktijkervaring en presenteren oplossingsrichtingen, in de overtuiging dat er forse besparingen mogelijk zijn als we tot herontwerp van bestaande processen bereid zijn en investeren in de versterking van de eerste- en anderhalvelijnszorg.
BMC ondersteunt op meer dan 250 plekken in Nederland bij zowel de transitie (invoering van wettelijke taken per decentralisatie) als de transformatie (vernieuwende en samenhangende aanpak voor de ombouw) van het sociale domein en het zorgdomein.
Wilt u meer informatie, neem dan contact op met Jan Roose, directeur Zorg en programmamanager ‘Zorg Dichtbij’ en Transitie Sociaal Domein, via e-mail: janroose@bmc.nl.
Las canciones del colegio son importantes para crear un sentido de comunidad y tradición entre los estudiantes y profesores. Canciones como el himno del colegio o canciones coreadas durante eventos deportivos ayudan a unir a todos bajo los colores y valores de la escuela. Estas canciones pasan de generación en generación y ayudan a mantener viva la historia y el espíritu del colegio a través de los años.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
Recent Trends in Incremental Clustering: A ReviewIOSRjournaljce
This document provides a review of recent trends in incremental clustering algorithms. It discusses clustering methods based on both similarity measures and those not based on similarity measures. Specific incremental clustering algorithms covered include single-pass clustering, k-nearest neighbors clustering, suffix tree clustering, incremental DBSCAN, and ICIB (incremental clustering based on information bottleneck theory). The document also reviews various techniques for clustering, including particle swarm optimization, ant colony optimization, and genetic algorithms. Applications of genetic algorithm based clustering are discussed.
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
This document proposes an enhanced incremental clustering algorithm for processing online data. It discusses existing clustering algorithms like leader clustering and hierarchical clustering, which have limitations in handling dynamic data. The proposed algorithm aims to dynamically create initial clusters and rearrange clusters based on data characteristics, allowing for more accurate clustering of online data over time. It also introduces a new frequency-based method for searching and retrieving specific data from fixed clusters.
This document discusses web document clustering using a hybrid approach in data mining. It begins with an abstract describing the huge amount of data on the internet and need to organize web documents into clusters. It then discusses requirements for document clustering like scalability, noise tolerance, and ability to present concise cluster summaries. Different existing document clustering approaches are described, including text-based and link-based approaches. The proposed approach uses a concept-based mining model along with hierarchical agglomerative clustering and link-based algorithms to cluster web documents based on both their content and hyperlinks. This hybrid approach aims to provide more relevant clustered documents to users than previous methods.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
This document discusses multidimensional clustering methods for data mining and their industrial applications. It begins with an introduction to clustering, including definitions and goals. Popular clustering algorithms are described, such as K-means, fuzzy C-means, hierarchical clustering, and mixture of Gaussians. Distance measures and their importance in clustering are covered. The K-means and fuzzy C-means algorithms are explained in detail. Examples are provided to illustrate fuzzy C-means clustering. Finally, applications of clustering algorithms in fields such as marketing, biology, and earth sciences are mentioned.
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...IOSR Journals
This document presents a density-based clustering technique called TDCT (Triangle-density based clustering technique) for efficiently clustering large spatial datasets. The technique uses a polygon approach where the number of data points inside each triangle of a polygon is calculated. If the ratio of point densities between two neighboring triangles exceeds a threshold, the triangles are merged into the same cluster. The technique is capable of identifying clusters of arbitrary shapes and densities. Experimental results demonstrate the technique has superior cluster quality and complexity compared to other methods.
This document discusses using machine learning clustering algorithms to analyze stock market data. It compares the K-means, COBWEB, DBSCAN, EM and OPTICS clustering algorithms in the WEKA tool on a stock market dataset containing 420 instances and 6 attributes. The K-means algorithm had the best performance with the lowest error and fastest runtime. It clustered the data into 4 groups in 0.16 seconds. The COBWEB algorithm clustered the data into 107 groups in 27.88 seconds. The DBSCAN algorithm found 21 clusters in 3.97 seconds. The paper concludes that K-means is best suited for stock market data mining applications due to its simplicity and speed compared to other algorithms.
The document provides an overview of different clustering methods including partitioning methods like k-means and k-medoids, hierarchical methods like agglomerative and divisive, and density-based methods like DBSCAN and OPTICS. It discusses the basic concepts of clustering, requirements for effective clustering like scalability and ability to handle different data types and shapes. It also summarizes clustering algorithms like BIRCH that aim to improve scalability for large datasets.
This document discusses hierarchical clustering and similarity measures for document clustering. It summarizes that hierarchical clustering creates a hierarchical decomposition of data objects through either agglomerative or divisive approaches. The success of clustering depends on the similarity measure used, with traditional measures using a single viewpoint, while multiviewpoint measures use different viewpoints to increase accuracy. The paper then focuses on applying a multiviewpoint similarity measure to hierarchical clustering of documents.
This document summarizes a research paper on applying a multiviewpoint-based similarity measure to hierarchical document clustering. It begins by introducing document clustering and hierarchical clustering. It then discusses traditional similarity measures used for clustering and introduces a new multiviewpoint-based similarity measure (MVS) that uses multiple reference points to more accurately assess similarity. The paper applies MVS to both hierarchical and k-means clustering algorithms and evaluates the accuracy, precision, and recall of the resulting clusters. It finds that hierarchical clustering with MVS achieves better performance than k-means clustering with MVS based on these evaluation metrics.
This document summarizes a research paper on clustering algorithms in data mining. It begins by defining clustering as an unsupervised learning technique that organizes unlabeled data into groups of similar objects. The document then reviews different types of clustering algorithms and methods for evaluating clustering results. Key steps in clustering include feature selection, algorithm selection, and cluster validation to assess how well the derived groups represent the underlying data structure. A variety of clustering algorithms exist and must be chosen based on the problem characteristics.
Incremental learning from unbalanced data with concept class, concept drift a...IJDKP
Recently, stream data mining applications has drawn vital attention from several research communities.
Stream data is continuous form of data which is distinguished by its online nature. Traditionally, machine
learning area has been developing learning algorithms that have certain assumptions on underlying
distribution of data such as data should have predetermined distribution. Such constraints on the problem
domain lead the way for development of smart learning algorithms performance is theoretically verifiable.
Real-word situations are different than this restricted model. Applications usually suffers from problems
such as unbalanced data distribution. Additionally, data picked from non-stationary environments are also
usual in real world applications, resulting in the “concept drift” which is related with data stream
examples. These issues have been separately addressed by the researchers, also, it is observed that joint
problem of class imbalance and concept drift has got relatively little research. If the final objective of
clever machine learning techniques is to be able to address a broad spectrum of real world applications,
then the necessity for a universal framework for learning from and tailoring (adapting) to, environment
where drift in concepts may occur and unbalanced data distribution is present can be hardly exaggerated.
In this paper, we first present an overview of issues that are observed in stream data mining scenarios,
followed by a complete review of recent research in dealing with each of the issue.
This document summarizes a research paper that proposes a new density-based clustering technique called Triangle-Density Based Clustering Technique (TDCT) to efficiently cluster large spatial datasets. TDCT uses a polygon approach where the number of data points inside each triangle of a polygon is calculated to determine triangle densities. Triangle densities are used to identify clusters based on a density confidence threshold. The technique aims to identify clusters of arbitrary shapes and densities while minimizing computational costs. Experimental results demonstrate the technique's superiority in terms of cluster quality and complexity compared to other density-based clustering algorithms.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
The improved k means with particle swarm optimizationAlexander Decker
This document summarizes a research paper that proposes an improved K-means clustering algorithm using particle swarm optimization. It begins with an introduction to data clustering and types of clustering algorithms. It then discusses K-means clustering and some of its drawbacks. Particle swarm optimization is introduced as an optimization technique inspired by swarm behavior in nature. The proposed algorithm uses particle swarm optimization to select better initial cluster centroids for K-means clustering in order to overcome some limitations of standard K-means. The algorithm works in two phases - the first uses particle swarm optimization and the second performs K-means clustering using the outputs from the first phase.
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation of the user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library would take enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessment technique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Certain Investigation on Dynamic Clustering in Dynamic Dataminingijdmtaiir
Clustering is the process of grouping a set of objects
into classes of similar objects. Dynamic clustering comes in a
new research area that is concerned about dataset with dynamic
aspects. It requires updates of the clusters whenever new data
records are added to the dataset and may result in a change of
clustering over time. When there is a continuous update and
huge amount of dynamic data, rescan the database is not
possible in static data mining. But this is possible in Dynamic
data mining process. This dynamic data mining occurs when
the derived information is present for the purpose of analysis
and the environment is dynamic, i.e. many updates occur.
Since this has now been established by most researchers and
they will move into solving some of the problems and the
research is to concentrate on solving the problem of using data
mining dynamic databases. This paper gives some
investigation of existing work done in some papers related with
dynamic clustering and incremental data clustering
Similar to Evaluating the Use of Clustering for Automatically Organising Digital Library Collections (20)
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulting from automatic enrichment - Aitor Soroa, Eneko Agirre, Arantxa Otegi and Antoine Isaac
This document is a case study on using the Europeana Data Model (EDM) [Doerr et al., 2010] for representing annotations of Cultural Heritage Objects (CHO). One of the main goals of
the PATHS project is to augment CHOs (items) with information that will enrich the user’s experience. The additional information includes links between items in cultural collections and from items to external sources like Wikipedia. With this goal, the PATHS project has applied Natural Language Processing (NLP) techniques on a subset of the items in Europeana.
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enrichment, Eneko Agirre, Ander Barrena, Kike Fernandez, Esther Miranda, Arantxa Otegi, and Aitor Soroa, paper presented the international conference on Theory and Practice in Digital Libraries, TPDL 2013
Large amounts of cultural heritage material are nowadays available through online digital library portals. Most of these cultural items have short descriptions and lack rich contextual information. The PATHS project has developed experimental enrichment services. As a proof of concept, this paper presents a web service prototype which allows independent content providers to enrich cultural heritage items with a subset of the full functionality: links to related items in the collection and links to related Wikipedia articles. In the future we plan to provide more advanced functionality, as available offline for PATHS.
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
Implementing Recommendations in the PATHS system, Paul Clough, Arantxa Otegi, Eneko Agirre and Mark Hall, paper presented at the Supporting Users Exploration of Digital Libraries, SUEDL 2013 workshop, during TPDL 2013 in Valetta, Malta
In this paper we describe the design and implementation of nonpersonalized recommendations in the PATHS system. This system allows users to explore items from Europeana in new ways. Recommendations of the type “people who viewed this item also viewed this item” are powered by pairs of viewed items mined from Europeana. However, due to limited usage data only 10.3% of items in the PATHS dataset have recommendations (4.3% of item pairs visited more than once). Therefore, “related items”, a form of contentbased recommendation, are offered to users based on identifying similar items. We discuss some of the problems with implementing recommendations and highlight areas for future work in the PATHS project.
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
This document describes research on developing a prototype system to enhance user interaction with cultural heritage collections through a pathway metaphor. It involved gathering user requirements through surveys and interviews. Key findings include:
1) Existing online paths tend to be linear and static, limiting exploration, though users preferred more flexible, theme-based paths that allowed branching.
2) Interviews found the path metaphor could represent search histories, journeys of discovery, linked metadata, guides into collections, routes through collections, and more.
3) An interaction model was developed involving consuming, collecting, creating and communicating about paths to support exploration, learning and engagement.
4) The prototype aims to integrate path creation, use and sharing to better support
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
Generating Paths through Cultural Heritage Collections, Samuel Fernando, Paula Goodale, Paul Clough, Mark Stevenson, Mark Hall and Eneko Agirre. Paper presented at Latech 2013
Cultural heritage collections usually organise sets of items into exhibitions or guided tours. These items are often accompanied by text that describes the theme and topic of the exhibition and provides background context and details of connections with other items. The PATHS project brings the idea of guided tours to digital library collections where a tool to create virtual paths are used to assist with navigation and provide guides on particular subjects and topics. In this paper we characterise and analyse paths of items created by users of our online system.
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
Workshop proceedings from the International workshop on Supporting Users Exploration of Digital Libraries, SUEDL 2012 which was held at TPDL 2012 (the international conference on Theory and Practice in Digital Libraries), Paphos, Cyprus, September 2012.
The aim of the workshop was to stimulate collaboration from experts and stakeholders in Digital Libraries, Cultural Heritage, Natural Language Processing and Information Retrieval in order to explore methods and strategies to support exploration of Digital Libraries, beyond the white box paradigm of search and click.
The proceedings includes:
"Browsing Europeana - Opportunities and Challenges', David Haskiya
"Query re-writing using shallow language processing effects', Anna Mastora and Sarantos Kapidakis
"Visualising Television Heritage" Johan Ooman et al,
"Providing suitable information access for new users of Digital Libraries", Rike Brecht et al
"Exploring Pelagios: a Visual Browser for Geo-tagged datasets" Rainer Simon et al
PATHS state of the art monitoring reportpathsproject
This document provides an update to an Initial State of the Art Monitoring report delivered by the project. The report covers the areas of Educational Informatics, Information Retrieval and Semantic Similarity relatedness.
Recommendations for the automatic enrichment of digital library content using...pathsproject
Recommendations for the enrichment of digital library content using open source software, PATHS report by Eneko Agirre and Arantxa Otegi
The goal of this document is to present an overall set of recommendations for the automatic enrichment of Digital Library content using open source software. It is intended to be useful for third-parties who would like to offer enrichment services. Note that this is not a step-by-step guide for reimplementation, but an overall view of the software required and the programming effort involved.
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
Generating Paths through Cultural Heritage Collections Samuel Fernando, Paula Goodale, Paul Clough, Mark Stevenson, Mark Hall and Eneko Agirre.
The PATHS project brings the idea of guided tours to digital library collections where a tool to create virtual paths are used to assist with navigation and provide guides on particular subjects and topics. In this paper we characterise and analyse paths of items created by users of our online system.
Generating PATHS through Cultural Heritage Collections, Samuel Fernando, Paula Goodale, Paul Clough,
Mark Stevenson, Mark Hall, Eneko Agirre. Presentation given at LaTeCH 2013, ACL Workshop, Sofia, Bulgaria.
The PATHS project is a 3-year EU-funded project involving 6 partners across 5 countries. The project aims to introduce personalized paths into digital cultural heritage collections to provide more engaging access to large volumes of online material. The PATHS system enriches metadata through natural language processing and links items within collections and to external resources. It provides various tools for browsing, searching and creating paths. Two rounds of user evaluations found the path creation tools and search mechanisms were well received. Outcomes include the PATHS API and potential commercialization of components and consultancy services.
This document summarizes the PATHS project, which developed tools for exploring digital cultural heritage collections. The project involved 6 partners across 5 countries. It researched methods for navigating collections, including user-created paths and natural language processing of metadata. Users can browse collections through a thesaurus, tag cloud, or topic map. The system allows users to create and publish nonlinear paths through the collection with descriptions. The tools have potential for classroom activities, curated collections, and research.
Comparing taxonomies for organising collections of documents presentationpathsproject
This document compares different taxonomies for organizing large collections of documents. It evaluates taxonomies that were either manually created (LCSH, WordNet domains, Wikipedia taxonomy, DBpedia ontology) or automatically derived from document data using LDA topic modeling or Wikipedia link frequencies. The document describes applying these taxonomies to a collection of over 550,000 items from Europeana, a digital library. It then evaluates the taxonomies based on how cohesive the groupings are and how accurately the relationships between parent and child nodes are classified.
SemEval-2012 Task 6: A Pilot on Semantic Textual Similaritypathsproject
This document describes the SemEval-2012 Task 6 on semantic textual similarity. The task involved measuring the semantic equivalence of sentence pairs on a scale from 0 to 5. The training data consisted of 2000 sentence pairs from existing paraphrase and machine translation datasets. The test data also had 2000 sentence pairs from these datasets as well as surprise datasets. Systems were evaluated based on their Pearson correlation with human annotations. 35 teams participated and the best systems achieved a Pearson correlation over 80%. This pilot task established semantic textual similarity as an area for further exploration.
A pilot on Semantic Textual Similaritypathsproject
This document summarizes the SemEval 2012 task on semantic textual similarity. It describes the motivation for the task as measuring similarity between text fragments on a graded scale. It then outlines the datasets used, including the MSR paraphrase corpus, MSR video corpus, WMT evaluation data, and OntoNotes word sense data. It also discusses the annotation process, which involved a pilot with authors and crowdsourcing through Mechanical Turk. The results showed most systems performed better than baselines and the best systems achieved correlations over 0.8 with human judgments.
PATHS Final prototype interface design v1.0pathsproject
This document summarizes the design methodology and current status of the interface design for the second prototype of the PATHS project. It begins with a three-stage design methodology that includes: evaluating the first prototype design process, creating low-fidelity storyboards, and developing high-fidelity interaction designs. It then reviews lessons learned from developing the first prototype interface. The document introduces new user interface components and presents preliminary high-fidelity designs for key pages like the landing page, path editing, and item pages. Expert evaluation of the designs is planned along with user evaluation of a working prototype. The goal is to address issues identified in prior evaluations and create an intuitive interface for the PATHS cultural heritage system.
PATHS Evaluation of the 1st paths prototypepathsproject
This document summarizes the evaluation of the first prototype of the PATHS project. It describes the evaluation methodology, which included field-based demonstrations and laboratory evaluations. Results are presented from both types of evaluations, including participant demographics, user feedback on ease of use and usefulness of PATHS, suggested improvements, and results from structured tasks conducted in the laboratory evaluations. The document also reviews how well the first PATHS prototype met its functional specifications.
PATHS Second prototype-functional-specpathsproject
This document provides the functional specifications for the second prototype of the PATHS project. It outlines functions to be implemented based on priorities - functions that resolve critical user interface issues are highest priority. The second prototype will address issues identified in testing the first prototype and provide enhanced functionality such as improved search, user accounts/permissions, and workspace customization. The specifications draw on recommendations from evaluating the first prototype to improve the PATHS system.
PATHS Final state of art monitoring report v0_4pathsproject
This document provides an update on the state of the art in several areas relevant to the PATHS project, including educational informatics, information retrieval, semantic similarity, and wikification. In educational informatics, the document discusses different approaches to evaluating cognitive styles and selects Riding's Cognitive Style Analysis as most appropriate for PATHS. In information retrieval, a paper outlines long-term research objectives, including moving beyond ranked lists and helping users. The document also discusses recent advances in measuring semantic textual similarity using new datasets, as well as improved accuracy in wikification and detecting related articles. New sections cover crowdsourcing, including its use for cultural heritage tasks, and the mobile web, focusing on responsive design principles.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
Evaluating the Use of Clustering for Automatically Organising Digital Library Collections
1. Evaluating the Use of Clustering for
Automatically Organising Digital Library
Collections
Mark Hall12 , Paul Clough2 , and Mark Stevenson1
1
(m.mhall|r.m.stevenson)@sheffield.ac.uk
Department for Computer Science
Sheffield University
Sheffield, UK
2
p.d.clough@sheffield.ac.uk
Information School
Sheffield University
Sheffield, UK
Abstract. Large digital libraries have become available over the past
years through digitisation and aggregation projects. These large collec-
tions present a challenge to the new user who wishes to discover what is
available in the collections. Subject classification can help in this task,
however in large collections it is frequently incomplete or inconsistent.
Automatic clustering algorithms provide a solution to this, however the
question remains whether they produce clusters that are sufficiently co-
hesive and distinct for them to be used in supporting discovery and ex-
ploration in digital libraries. In this paper we present a novel approach
to investigating cluster cohesion that is based on identifying instruders
in a cluster. The results from a human-subject experiment show that
clustering algorithms produce clusters that are sufficiently cohesive to
be used where no (consistent) manual classification exists.
1 Introduction
Large digital libraries have become available over the past years through digiti-
sation and aggregation projects. These large collections present two challenges to
the new user [22]. The first is resource discovery: finding the collection in the first
place. The second is then discovering what items are present in the collection.
In current systems, support for item discovery is mainly through the standard
search paradigm [27], which is well suited for professional (or expert) users who
are highly familiar with the collections, subject areas, and have specific search
goals.
However, for the novice (or non-expert) user exploring, investigating, and
learning [16, 21] tend to be more useful search modalities. To support these
modalities the items in the collection must be classified according to a relatively
consistent schema, which is frequently not the case.
2. In many domains no standard classification system exists and, even if it
does, collections are often classified inconsistently. Additionally, where collections
have been formed through aggregation (e.g. in large-scale digital libraries) the
items will frequently be classified using different and incompatible classification
systems. Manual (re-)classification would be the ideal solution, however the time
and expense requirement when dealing with hundreds of thousands or millions
of items means that it is not a viable approach.
Automatic clustering techniques provide a potential solution that can be ap-
plied to large-scale collections where manual classification is not feasible. The
advantage of these techniques is that they automatically derive the cluster struc-
ture from the digital library’s items. On the other hand the quality of the results
can be variable and thus the choice of which clustering technique to employ is
central to providing an improved exploration experience.
The research questions posed at the start of this work was: Do automatic
clustering techniques produce clusters that are cohesive enough to be used to
support the exploration of digital libraries? We define a cohesive cluster as one
in which the items in the cluster are similar, while at the same time clearly dis-
tinguishable from items in other clusters. Our paper provides two major contri-
butions in this area. Firstly, we propose a novel variant of the intruder detection
task [6] that enables the measurement of the cohesion of automatically generated
clusters. Secondly, we apply this task to evaluate the cluster model quality of a
number of automatic clustering and topic modelling algorithms.
Our results show that the clusters are sufficiently good to be used in digital
libraries where manually assigned classifications are not available or not con-
sistent. The remainder of the paper is structured as follows: The next section
provides background information on the use of clustering in digital libraries,
their evaluation, and the clustering techniques evaluated in this paper. Section 3
describes the methodology used in the evaluation experiment and section 4 the
experiment results. Section 5 concludes the paper.
2 Background
The issues large, aggregated digital libraries present to the user were first high-
lighted in [22] who suggested manual classification by the user and automated
clustering as approaches for dealing with the large amounts of information pro-
vided by these digital libraries. Since then a number of digital library exploration
interfaces based on clustering documents [26, 9, 8] and search results [11, 28] have
been proposed. Most of these approaches were evaluated in task-based scenarios
and shown to improve task performance, however the cluster quality itself was
not evaluated.
2.1 Cluster Evaluation Metrics
Cluster evaluation has traditionally focused on automatic evaluation metrics.
They are frequently tested on synthetic or manually pre-classified data [17, 1]
3. or using statistical methods [29, 13]. However, these do not necessarily capture
whether the resulting clusters are cohesive from the user’s perspective.
There have been attempts at using human judgments to quantify the cohesion
of automatic clustering techniques. Mei et al. [18] evaluate the cohesion of Latent
Dirichlet Allocation topics in the context of automatically labelling these topics.
The number of changes evaluators make to a clustering has also been used to
judge cluster cohesion [24].
Chang et al. [6] devised the “intruder detection” task, where evaluators are
shown the top five keywords for an LDA topic to which a keyword from a different
topic is added. They then have to identify the added “intruder” keyword and
the success at identifying the intruder is used as a proxy to evaluate the topic’s
cohesion. The more cohesive a topic, the more obvious it is which keyword is the
intruder. The results of their work have been compared to a number of automatic
similarity algorithms and Pointwise-Mutual-Information (PMI) was identified as
a good predictor for the agreement between the evaluators [20].
2.2 Classification Models
This paper investigates three unsupervised clustering algorithms: Latent Dirich-
let Allocation (LDA) [5], K-Means clustering [14], and OPTICS clustering [2].
Hierarchical and spectral clustering algorithms were also investigated, but not
tested due to them either being too computationally complex for the data-set
size or producing only degenerate clusterings.
Latent Dirichlet Allocation (LDA) is a state-of-the-art topic modelling
algorithm, that creates a mapping between a set of topics T and a set of items I,
where each item i ∈ I is linked to one or more topics t ∈ T . Each item is input
into LDA as a bag-of-words and then represented as a probabilistic mixture of
topics. The LDA model consists of a multinomial distribution of items over topics
where each topic is itself a multinomial distribution over words. The item-topic
and topic-word distributions are learned simultaneously using collapsed Gibbs
sampling based on the item - word distributions observed in the source collection
[10]. LDA has been used to successfully improve result quality in Information
Retrieval [3, 30] tasks and is thus well suited to support exploration in digital
libraries. Although LDA provides multiple topics per item, in this paper items
will only be assigned to their highest-ranking topic and the topics will be referred
to as clusters for consistency with the other algorithms.
K-Means Clustering is a frequently used clustering method [31] that takes
one input parameter k and assigns the n input items to k clusters and has been
used in IR [12]. Items are assigned to the clusters in order to maximise the intra-
cluster similarity while minimising the inter-cluster similarity. Cluster similarity
is calculated relative to the cluster’s mean value.
K-Means uses random initial cluster centres and then iteratively improves
these by assigning items to the most similar cluster and moving the cluster
centres to the mean of the items in the cluster.
4. OPTICS Clustering is a density-based clustering algorithm that does not
directly produce a cluster assignment, but instead provides an ordering for the
items that can then be used to create clusters with arbitrary density thresholds.
The algorithm defines a reachability value for each item which specifies the
distance to the next item in the ordering. Large reachability values represent
the boundaries between clusters and depending on what reachability threshold
is chosen, a larger or smaller number of clusters is generated.
3 Methodology
In this paper we propose a novel version of the “intruder detection” task that
evaluates the cohesion of the items in a cluster instead of just the cluster’s
keywords. To generate this “intruder detection” task, a cluster (the main cluster)
is chosen at random from the clustering model and four items are chosen at
random from the items allocated to that cluster. A second cluster, the intruder
cluster, is also chosen at random and a random item chosen from that cluster
as the intruder item. The five items, termed a unit, are then shown to the
participants and they are asked to identify the intruder item. For each of the
tested models the evaluation set consisted of 30 such units.
3.1 Source Data
The source data used in the experiments is a collection of 28,133 historical im-
ages with meta-data provided by the University of St Andrews Library [7]. The
majority (around 85%) of images were taken before 1940 and span a range of
160 years (1839-1992). The images mainly cover the United Kingdom, however
there are also images taken around the world. Most (89%) of photographs are
black and white, although there are some colour photographs. Of the available
meta-data fields only the the title, description, and manually annotated subject
classification are used in the experiments. On average, each image is assigned
to four categories (median=4; mean=4.17; σ = 1.631) and the items’ title and
description tend to be relatively short (word-count: median=23; mean=21.66;
σ = 9.5857). Examples are shown in Figures 1 and 2.
The collection was chosen for a two main of reasons: first, the collection has
a manually annotated subject classification that provides an evaluation baseline;
second, the data provides a realistic test case, as it was taken from an existing
library archive (enabling the generalisation of results to other digital libraries),
is large enough to make manual classification time-consuming, and at the same
time small enough that it can be processed in a reasonable time-frame.
3.2 Data Preparation
Each item’s title and description were processed using the NLTK [15] to carry out
sentence splitting and tokenization. The resulting bags-of-words are the input
into the three clustering algorithms. All processing was performed on an Intel i7
@1.73 GHz with 8GB RAM. Processing times are shown in Tab. 1.
5. Ottery St Mary Church of Ire- Melrose Abbey. Elgin Cathe- Isle of Arran.
Church. land, Killough, dral. Corrie from
Co Down. the Water,
Fig. 1. Example of a cohesive unit taken from the “K-Means TFIDF” model. The
intruder is the last image, in bold font.
Moorland be- Panorama of Kobresia caric-
side Sand Sike the Cairn- ina, plants
Luzern, Lucerne(Syke) and gorm Hills, in flower, on Luzern, Lucerne
River Tees; viewed from moorland be-
Untitled. plant habitat. Docharn tween Sand Untitled.
Craig, Strath- Sike (Syke)
spey. and River Tees,
Teesdale.
Fig. 2. Example of a non-cohesive unit taken from the “K-Means TFIDF” model. The
intruder is the middle image, in bold font.
Model Wall-clock time
LDA 300 clusters 00:21:48
LDA 900 clusters 00:42:42
LDA + PMI 300 clusters 05:05:13
LDA + PMI 900 clusters 17:26:08
K-Means - TFIDF 09:37:40
K-Means - LDA 03:49:04
OPTICS - TFIDF 12:42:13
OPTICS - LDA 05:12:49
Table 1. Processing time (wall-clock) for the tested clustering algorithms and initial-
isation parameters
6. LDA Two LDA-based clusterings were created using Gensim [23], one with 300
clusters (“LDA 300”), one with 900 clusters (“LDA 900”). The reason for testing
two cluster numbers is that 300 clusters is in line with the number of clusters in
other work using LDA [30]. At the same time our work on visualising clusters has
hinted that clusters with around 30 items work best, which with 28000 items
leads to 900 clusters. Although LDA provides a list of topics with probabilities
for each item, the items are assigned only to their highest-ranking topic in order
to maintain comparability with the clustering results.
Previous work [20] indicates that pointwise mutual information (PMI) acts
as a good predictor of cluster cohesion. A modified cluster assignment model
designed to increase the cohesiveness of the assigned clusters was developed.
This approach is based on repeatedly creating LDA models and only selecting
those clusters that have sufficiently high PMI scores.
The algorithm starts by creating an LDA model for all items using n clusters.
The clusters are then filtered based on the median PMI score of their keywords
t1 . . . t5 (eq. 2), creating the filtered set Tg of “good” clusters (eq. 3). Items for
which the highest ranked cluster t ∈ Tg are assigned to that cluster. A new LDA
model is then calculated using the items for which their highest ranked cluster
t ∈ Tg using a reduced number of clusters n − |Tg |.
/
p (x, y) p (x | y)
pmi (x, y) = log = log (1)
p (x) p (y) p (x)
coh (t) = median {pmi (ti , tj ) : i, j ∈ 1 . . . 5 ∧ i = j} (2)
Tg = {t ∈ T : coh(t) > 0} (3)
This process is repeated until either all items have been assigned to a cluster
or the LDA model contains no clusters with a coh (t) > 0, in which case all items
are assigned to their highest ranked cluster and the algorithm terminates. Two
models using this algorithm with 300 and 900 clusters were created (“LDA +
PMI 300” and “LDA + PMI 900”).
K-Means Two k-means classifications were produced, both with 900 clus-
ters. The first used term-frequency / inverse-document-frequency (TFIDF) vec-
tors, calculated from the items’ bags-of-words, to define each item (“K-Means
TFIDF”). As Tab. 1 shows the time required to create this model was very high,
thus a faster k-means clustering was created using the item-topic probabilities
from a 900-topic LDA model to define each item (“K-Means LDA”).
OPTICS The OPTICS clustering used the same input data as the k-means
clustering, creating two models (“OPTICS TFIDF” and “OPTICS LDA”). The
reachability threshold required to create a fixed set of clusters was automatically
determined for both models using an unsupervised binary search algorithm.
7. Upper- and Lower-bound Data An upper bound data-set was created based
on the manually annotated subject classifications provided in the original meta-
data. This classification has a total of 936 distinct clusters from which the 30
tested units were selected using the random algorithm described above.
To aid in the interpretation of the results a lower bound was determined
statistically as the number of cohesive clusters where the binomial likelihood
of seeing that number of correctly identified units out of 30 is less than 5%,
resulting in a lower bound of 3 correctly identified units.
Control Data To ensure that the participants took the task seriously and did
not simply select an answer at random, a set of 10 control units were created.
These were randomly selected from the manual subject classifications and then
manually filtered to ensure that the intruder was as obvious as possible.
3.3 Experimental Set-up
The experiment was constructed using an in-house crowdsourcing interface. In
the experiment each unit was displayed as a list of five images with their cap-
tions, and five radioboxes that the participants used to choose the intruder.
Participants were shown five units on one page, one of which was always a con-
trol unit. The four model units were randomly sampled from the full list of units
(the model units and upper bound units). The sampling took into account the
number of judgements already gathered for the units to ensure a relatively even
distribution of judgments. The experiment was run using a population recruited
from staff and students at our university.
A total of 821 people participated in the experiment, producing a total of
10,706 ratings. 121 participants answered less than half of the control questions
they saw correctly and have thus been excluded from the analysis, reducing the
number of ratings analysed to 8,840, with each unit rated between 21 and 30
times, with the median number of ratings at 27. The large variation is due to
how the filtered participants’ ratings were distributed, but has no impact on the
results as the evaluation metric takes the number of samples into account.
3.4 Evaluating Cohesion
The human judgements were analysed to determine which units were judged to
be cohesive. The metric used to determine cohesiveness is strict. A unit is judged
to be cohesive if the correct intruder is chosen significantly more frequently than
by chance and if the answer distribution is significantly different from a uniform
distribution (Fig. 1). The first aspect is tested using a binomial distribution and
testing whether the likelihood of seeing the observed number of correct intruder
judgements relative to the total number of judgements for the unit by chance is
less than 5%. This does not necessarily guarantee a cohesive cluster, as it does
not take into account the distribution of the remaining answers. If these were
evenly distributed, then even though the intruder was detected by a significant
8. number of participants, the remaining participants were evenly split and thus the
unit cannot be classified as cohesive. A χ2 -test was used to determine whether
the answer distribution was significantly different (p < 0.05) from the uniform
distribution. If both conditions hold then the unit and with it the main cluster
the unit was derived from, are said to be cohesive.
In addition, a second metric was used to judge whether units were borderline
cohesive. A unit (and its main cluster) are defined as borderline cohesive if the
total number of judgements allocated to two of the five possible answers makes
up more than 95% of all judgements for that unit and one of the two answers
is the intruder. This covers the case where in addition to the intruder item
there is a second item that could also be the intruder. The main cluster such a
unit is derived from might not be ideal, but is probably acceptable to the user,
especially as the evaluation will show that even manually created clusters can
be non-cohesive. All remaining units are classified as “non-cohesive” (Fig. 2).
4 Results
Table 2 shows the number of “cohesive”, “borderline” and “non-cohesive” clus-
ters per model. The results clearly show that k-means clustering based on TFIDF
produces the most cohesive clusters. LDA with a large number of clusters also
works well. The impact of filtering by PMI seems to be negligible. OPTICS
clustering does not work well on the tested data-set.
Model Cohesive Borderline Non-Cohesive
Upper-bound 27 0 3
Lower-bound 3 0 27
LDA 300 clusters 15 6 9
LDA 900 clusters 20 4 6
LDA + PMI 300 clusters 16 4 10
LDA + PMI 900 clusters 21 2 7
K-Means - TFIDF 24 3 3
K-Means - LDA 20 0 10
OPTICS - TFIDF 14 2 14
OPTICS - LDA 16 0 14
Table 2. Experiment results for the various clustering algorithms and initialisation
parameters. “Cohesive” lists the number of clusters were the intruder was consistently
identified, “borderline” the number of clusters with two potential intruders, and “non-
cohesive” the number of clusters that are neither “cohesive” nor “borderline”.
Table 1 shows the time required to generate each of the clusterings. The
pure LDA models are fastest, while LDA + PMI with 900 clusters is the slowest
algorithm. OPTICS and k-means lie between these extremes. Using LDA item-
topic distributions for item similarity is faster than using TFIDF vectors.
9. 4.1 Discussion
All models show a clear improvement on the lower bound and can thus be said
to provide at least some benefit if no manual classification is available. However,
the OPTICS and “LDA 300” models achieve cohesion for only about 50% of the
clusters. OPTICS is clearly not a good choice for the type of data tested, as it
is either slower or less cohesive than the other techniques.
The upper bound achives a very high score (90% of clusters cohesive), how-
ever even here there are three units that were not cohesive and these were further
investigated (Tab. 3). The non-cohesive unit #1 is mis-classified in the original
data and the intruder item was from the same geographic area as the main clus-
ter items, making it impossible to determine which is the intruder. That this was
picked up by the experiment participants is a good indication that the “intruder
detection” task can distinguish cohesive from non-cohesive clusters. Analysis of
the other two non-cohesive units shows that for both neither the image nor the
caption provide sufficient information to identify the intruder. However, when
the classification label is known, then the intruder can be determined. What this
implies is that as long as there is some kind of logical link between the items, a
certain amount of variation between the items is acceptable to human classifiers.
# Main cluster Intruder cluster
1 Renfrews all views Isle of May all views
2 Garages - commercial Colleges - technical
3 Soldiers Belfries
Table 3. Subject labels attached to the non-cohesive upper-bound units
“K-Means TFIDF” is clearly the best of the models, achieving cohesion for
80% of the units and including the 3 borderline cohesive units pushes the rate up
to 90%, matching the manual classification. Post-hoc analysis of the three border
line units shows that in all three cases the main cluster item also identified as
an intruder is linked to the other main cluster items via the description text,
which the participants did not see. This means that those three clusters should
also be acceptable when the users have access to all of the items’ meta-data.
The drawback with k-means is the long processing time rquired to create the
model. Using the LDA document-cluster distribution instead of TFIDF leads to
a significant reduction in processing time, however the quality of the resulting
model suffers and is lower than the pure LDA results. K-Means is thus only viable
for smaller collections, although the exact limit depends on what optimisations
[25] can be achieved through improved initialisation [4] or parallelisation [32].
Processing speed is the strength of the pure LDA models, with the best-
performing “LDA 900” model faster than “K-Means TFIDF” by a factor of
almost 14 (Tab. 1). It does not achieve quite as good classification results (66%
of units cohesive), but for data-sets that are too large to be clustered using k-
means it is a viable alternative. Of the four borderline units two were of similar
10. type as the k-means borderline units, however in the other two the item from
the main cluster identified as the intruder had no connection to the other main
cluster items, leading to a total of 22 (73%) acceptable clusters.
The results of the LDA models also show the necessity for models where
the individual clusters do not have too many items. The “LDA 300” and “LDA
+ PMI 300” models are significantly worse than their 900-cluster counterparts.
This also validates the use of 900 clusters in the k-means and OPTICS models.
While [19, 20] show that PMI is a good predictor for the inter-annotator
agreement, using it to filter clusters shows only minimal improvement. For both
the 300 and 900 cluster models a single additional cohesive unit is created.
Additionally in the 900 cluster case, PMI reduces the number of borderline
units. Considering the increase in processing time this is not a viable method.
5 Conclusion
Large digital libraries have become available over the past years through digitisa-
tion and aggregation projects. These large collections present a challenge to the
new user who wishes to discover what is available in the collections. Supporting
this discovery task would benefit from a consistent classification system, which is
frequently not available. Manual re-classification is prohibitively expensive and
time consuming. Automatic cluster models provide an alternative method for
quickly generating classifications.
This paper investigated whether clustering algorithms can generate cohesive
clusters, where a cohesive cluster is one in which the items in the cluster are
similar, while at the same time being distinguishably different from items in other
clusters. Latent Dirichlet Allocation (LDA), K-Means clustering, and OPTICS
clustering were investigated. To enable the comparison we proposed a novel
version of the “intruder detection” task, where the experiment participants have
to identify an item taken from a cluster and inserted into a set of four items
taken from a different cluster. The results show that this task provides a good
measurement for the cohesion of cluster models and can successfully identify
non-cohesive clusters and mis-classifications.
Using this evaluation metric we showed that k-means clustering on TFIDF
vectors produces the highest number of cohesive clusters, but is computation-
ally intensive and thus only viable for smaller collections. LDA-based models
with large cluster numbers provide the best cohesion – processing time trade-
off, allowing them to be applied to large digital libraries. We believe that both
algorithms create models where a sufficiently large number of clusters are cohe-
sive to allow them to be used where no (consistent) classification is available. We
intend to investigate if post-processing the clusters can further improve cohesion.
Acknowledgements
The research leading to these results has received funding from the European
Community’s Seventh Framework Programme (FP7/2007-2013) under grant agree-
11. ment n◦ 270082. We acknowledge the contribution of all project partners involved
in PATHS (see: http://www.paths-project.eu).
References
1. Amigo, E., Gonzalo, J., Artiles, J., and Verdejo, F. A comparison of extrin-
´
sic clustering evaluation metrics based on formal constraints. Information Retrieval
12 (2009), 461–486. 10.1007/s10791-008-9066-8.
2. Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. Optics: or-
dering points to identify the clustering structure. SIGMOD Rec. 28, 2 (June 1999),
49–60.
3. Azzopardi, L., Girolami, M., and van Rijsbergen, C. Topic based language
models for ad hoc information retrieval. In Neural Networks, 2004. Proceedings.
2004 IEEE International Joint Conference on (july 2004), vol. 4, pp. 3281 –3286
vol.4.
4. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., and Vassilvitskii, S.
Scalable k-means++. In VLDB 2012 (2012).
5. Blei, D. M., Griffiths, T., Jordan, M., and Tenenbaum, J. Hierarchical
topic models and the nested chinese restaurant process. In NIPS (2003).
6. Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Blei, D. M. Read-
ing tea leaves: How humans interpret topic models. In NIPS (2009).
7. Clough, P., Sanderson, M., , and Reid, N. The eurovision st andrews collection
of photographs. ACM SIGIR Forum 40, 1 (2006), 21–30.
8. Eklund, P., Goodall, P., and Wray, T. Cluster-based navigation for a virtual
museum. In Adaptivity, Personalization and Fusion of Heterogeneous Information
(Paris, France, France, 2010), RIAO ’10, Le Centre de Hautes Etudes Interna-
tionales d’Informatique Documentaire, pp. 211–212.
9. Granitzer, M., Kienreich, W., Sabol, V., Andrews, K., and Klieber, W.
Evaluating a system for interactive exploration of large, hierarchically structured
document repositories. In Information Visualization, 2004. INFOVIS 2004. IEEE
Symposium on (0-0 2004), pp. 127 –134.
10. Griffiths, T., and M. Steyvers, M. Finding scientific topics. In Proceedings
of the National Academiy of Science (2004), vol. 101, pp. 5228–5235.
11. Handl, J., and Meyer, B. Improved ant-based clustering and sorting in a
document retrieval interface. In Parallel Problem Solving from Nature — PPSN
VII, J. Guerv´s, P. Adamidis, H.-G. Beyer, H.-P. Schwefel, and J.-L. Fern´ndez-
o a
Villaca˜as, Eds., vol. 2439 of Lecture Notes in Computer Science. Springer Berlin
n
/ Heidelberg, 2002, pp. 913–923. 10.1007/3-540-45712-7 88.
12. Hassan-Montero, Y., and Herrero-Solana, V. Improving tag-clouds as visual
information retrieval interfaces. In Proceedings InfoSciT (2006).
13. He, J., Tan, A.-H., Tan, C.-L., and Sun, S.-Y. On quantitative evaluation
of clustering systems. In Information Retrieval and Clustering. Kluwer Academic
Publishers, 2003, pp. 105–133.
14. Lloyd, S. P. Least square quantization in pcm. IEEE Transactions on Informa-
tion Theory 28, 2 (1982), 129–137.
15. Loper, E., and Bird, S. Nltk: the natural language toolkit. In Proceedings of
the ACL-02 Workshop on Effective tools and methodologies for teaching natural
language processing and computational linguistics - Volume 1 (Stroudsburg, PA,
USA, 2002), ETMTNLP ’02, Association for Computational Linguistics, pp. 63–70.
12. 16. Marchionini, G. Exploratory search: From finding to understanding. Communi-
cations of the ACM 49, 4 (2006), 41–46.
17. Maulik, U., and Bandyopadhyay, S. Performance evaluation of some clustering
algorithms and validity indices. Pattern Analysis and Machine Intelligence, IEEE
Transactions on 24, 12 (dec 2002), 1650 – 1654.
18. Mei, X. S., and Zhai, C. Automatic labeling of multinomial topic models. In
Proceedings of KDD 2007 (2007), pp. 490–499.
19. Newman, D., Karimi, S., and Cavedon, L. External evaluation of topic models.
In Proceedings of teh 14th Australasian Document Computing Symposum (2009),
pp. 11–18.
20. Newman, D., Noh, Y., Talley, E., Karimi, S., and Baldwin, T. Evaluating
topic models for digital libraries. In JCDL 2010 (2010).
21. Pirolli, P. Powers of 10: Modeling complex information-seeking systems at mul-
tiple scales. Computer 42, 3 (2009), 33–40.
22. Rao, R., Pedersen, J. O., Hearst, M. A., Mackinlay, J. D., Card, S. K.,
Masinter, L., Halvorsen, P.-K., and Robertson, G. C. Rich interaction in
the digital library. Commun. ACM 38, 4 (Apr. 1995), 29–39.
ˇ uˇ
23. Reh˚rek, R., and Sojka, P. Software Framework for Topic Modelling with
Large Corpora. In Proceedings of the LREC 2010 Workshop on New Chal-
lenges for NLP Frameworks (Valletta, Malta, May 2010), ELRA, pp. 45–50.
http://is.muni.cz/publication/884893/en.
24. Roussinov, D. G., and Chen, H. Document clustering for electronic meetings:
an experimental comparison of two techniques. Decision Support Systems 27, 1–2
(1999), 67–79.
25. Sculley, D. Web-scale k-means clustering. In WWW 2010 (2010).
26. Song, M. Bibliomapper: a cluster-based information visualization technique. In
Information Visualization, 1998. Proceedings (1998), pp. 130–136.
27. Sutcliffe, A., and Ennis, M. Towards a cognitive theory of information retrieval.
Interacting with Computers 10 (1998), 321–351.
28. van Ossenbruggen, J., Amin, A., Hardman, L., Hildebrand, M., van Assem,
M., Omelayenko, B., Schreiber, G., Tordai, A., de Boer, V., Wielinga,
B., Wielemaker, J., de Niet, M., Taekema, J., van Orsouw, M.-F., and
Teesing, A. Searching and annotating virtual heritage collections with semantic-
web technologies. In Museums and the Web 2007 (2007).
29. Wallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D. Evaluation
methods for topic models. In Proceedings of the 26th International Confer- ence
on Machine Learning (2009).
30. Wei, X., and Croft, W. B. Lda-based document models for ad-hoc retrieval. In
Proceedings of the 29th annual international ACM SIGIR conference (New York,
NY, USA, 2006), SIGIR ’06, ACM, pp. 178–185.
31. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H.,
McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M., Hand,
D., and Steinberg, D. Top 10 algorithms in data mining. Knowledge and Infor-
mation Systems 14 (2008), 1–37. 10.1007/s10115-007-0114-2.
32. Zhao, W., Ma, H., and He, Q. Parallel k-means clustering based on mapreduce.
In Cloud Computing, M. Jaatun, G. Zhao, and C. Rong, Eds., vol. 5931 of Lec-
ture Notes in Computer Science. Springer Berlin / Heidelberg, 2009, pp. 674–679.
10.1007978-3-642-10665-1 71.