This document provides a survey of optimization approaches that have been applied to text document clustering. It discusses several clustering algorithms and categorizes them as partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, frequent pattern-based clustering, and constraint-based clustering. It then describes several soft computing techniques that have been used as optimization approaches for text document clustering, including genetic algorithms, bees algorithms, particle swarm optimization, and ant colony optimization. These optimization techniques perform a global search to improve the quality and efficiency of document clustering algorithms.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
Text documents clustering using modified multi-verse optimizerIJECEIAES
In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the final results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce significant results in comparison with three well-established methods.
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Survey on traditional and evolutionary clustering approacheseSAT Journals
Abstract Clustering deals with grouping up of similar objects. Unlike classification, clustering tries to group a set of objects and find whether there is some relationship between the objects whereas in classification a set of predefined classes will be known and it is enough to find which class a object belongs. Simply, classification is a supervised learning technique and clustering is an unsupervised learning technique. Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances. It naturally requires different techniques to the classification and association learning methods .Clustering has many applications in various fields. In Software engineering it helps in reverse engineering, software maintenance and for re-building systems. It aims at breaking a larger problem into small pieces of understanding elements. There are many approaches available to carry out clustering. Since clustering has no particular methodology there are many methods available for carrying out clustering. There are many traditional as well as evolutionary methods available for carrying out clustering. In this paper various types of the above mentioned methods are described and some of them are compared. Each method has its own advantage and they can be used according to the needs of the user. Keywords: Clustering, Classification, Software Engineering, Traditional, Evolutionary.
Clustering of high dimensionality data which can be seen in almost all fields these days is becoming
very tedious process. The key disadvantage of high dimensional data which we can pen down is curse
of dimensionality. As the magnitude of datasets grows the data points become sparse and density of
area becomes less making it difficult to cluster that data which further reduces the performance of
traditional algorithms used for clustering. Semi-supervised clustering algorithms aim to improve
clustering results using limited supervision. The supervision is generally given as pair wise
constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors [2]. In this paper, we unify vector-based and graph-based
approaches. We first show that a recently-proposed objective function for semi-supervised clustering
based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of
constraint penalty functions, can be expressed as a special case of the global kernel k-means objective
[3]. A recent theoretical connection between global kernel k-means and several graph clustering
objectives enables us to perform semi-supervised clustering of data. In particular, some methods have
been proposed for semi supervised clustering based on pair wise similarity or dissimilarity
information. In this paper, we propose a kernel approach for semi supervised clustering and present in
detail two special cases of this kernel approach.
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
Attribute Selection is an important topic in Data Mining, because it is the effective way for reducing dimensionality, removing irrelevant data, removing redundant data, & increasing accuracy of the data. It is the process of identifying a subset of the most useful attributes that produces compatible results as the original entire set of attribute. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called a cluster are more similar in some sense or another to each other than to those in other groups (Clusters). There are various approaches & techniques for attribute subset selection namely Wrapper approach, Filter Approach, Relief Algorithm, Distributional clustering etc. But each of one having some disadvantages like unable to handle large volumes of data, computational complexity, accuracy is not guaranteed, difficult to evaluate and redundancy detection etc. To get the upper hand on some of these issues in attribute selection this paper proposes a technique that aims to design an effective clustering based attribute selection method for high dimensional data. Initially, attributes are divided into clusters by using graph-based clustering method like minimum spanning tree (MST). In the second step, the most representative attribute that is strongly related to target classes is selected from each cluster to form a subset of attributes. The purpose is to increase the level of accuracy, reduce dimensionality; shorter training time and improves generalization by reducing over fitting.
automatic classification in information retrievalBasma Gamal
automatic classification in information retrieval-automatic classification of documents
Chapter 3 from IR_VAN_Book
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN B.Sc., Ph.D., M.B.C.S.
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...ijcsa
Front end of data collection and loading into database manually may cause potential errors in data sets and a very time consuming process. Scanning of a data document in the form of an image and recognition of corresponding information in that image can be considered as a possible solution of this challenge. This paper presents an automated solution for the problem of data cleansing and recognition of user written data to transform into standard printed format with the help of artificial neural networks. Three different neural models namely direct, correlation based and hierarchical have been developed to handle this issue. In a very hostile input environment, the solution is developed to justify the proposed logic.
Expert system of single magnetic lens using JESS in Focused Ion Beamijcsa
This work shows expert system of symmetrical single magnetic lens used in focused ion beam optical system. Java expert system shell(JESS) programming is proposed to build the intelligent agent "MOPTION"for getting an optimum magnetic flux density , and calculate the ion optical trajectory. The combination of such rule based engine and SIMION 8.1 has configured the reconstruction process and compiled the data retrieved by the proposed expert system agent to implement the pole-pieces reconstruction for lens design. The pole pieces reconstruction has been resulted in 3D graph , and under the infinite magnification conditions of the optical path, aberration (spherical / chromatic and total) disks diameters have been obtained and got the values (0.03,0.13 and 0.133) micron (μm) respectively.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
Text documents clustering using modified multi-verse optimizerIJECEIAES
In this study, a multi-verse optimizer (MVO) is utilised for the text document clus- tering (TDC) problem. TDC is treated as a discrete optimization problem, and an objective function based on the Euclidean distance is applied as similarity measure. TDC is tackled by the division of the documents into clusters; documents belonging to the same cluster are similar, whereas those belonging to different clusters are dissimilar. MVO, which is a recent metaheuristic optimization algorithm established for continuous optimization problems, can intelligently navigate different areas in the search space and search deeply in each area using a particular learning mechanism. The proposed algorithm is called MVOTDC, and it adopts the convergence behaviour of MVO operators to deal with discrete, rather than continuous, optimization problems. For evaluating MVOTDC, a comprehensive comparative study is conducted on six text document datasets with various numbers of documents and clusters. The quality of the final results is assessed using precision, recall, F-measure, entropy accuracy, and purity measures. Experimental results reveal that the proposed method performs competitively in comparison with state-of-the-art algorithms. Statistical analysis is also conducted and shows that MVOTDC can produce significant results in comparison with three well-established methods.
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
A Novel Penalized and Compensated Constraints Based Modified Fuzzy Possibilis...ijsrd.com
A cluster is a group of objects which are similar to each other within a cluster and are dissimilar to the objects of other clusters. The similarity is typically calculated on the basis of distance between two objects or clusters. Two or more objects present inside a cluster and only if those objects are close to each other based on the distance between them.The major objective of clustering is to discover collection of comparable objects based on similarity metric. Fuzzy Possibilistic C-Means (FPCM) is the effective clustering algorithm available to cluster unlabeled data that produces both membership and typicality values during clustering process. In this approach, the efficiency of the Fuzzy Possibilistic C-means clustering approach is enhanced by using the penalized and compensated constraints based FPCM (PCFPCM). The proposed PCFPCM approach differ from the conventional clustering techniques by imposing the possibilistic reasoning strategy on fuzzy clustering with penalized and compensated constraints for updating the grades of membership and typicality. The performance of the proposed approaches is evaluated on the University of California, Irvine (UCI) machine repository datasets such as Iris, Wine, Lung Cancer and Lymphograma. The parameters used for the evaluation is Clustering accuracy, Mean Squared Error (MSE), Execution Time and Convergence behavior.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Survey on traditional and evolutionary clustering approacheseSAT Journals
Abstract Clustering deals with grouping up of similar objects. Unlike classification, clustering tries to group a set of objects and find whether there is some relationship between the objects whereas in classification a set of predefined classes will be known and it is enough to find which class a object belongs. Simply, classification is a supervised learning technique and clustering is an unsupervised learning technique. Clustering techniques apply when there is no class to be predicted but rather when the instances are to be divided into natural groups. These clusters presumably reflect some mechanism at work in the domain from which instances are drawn, a mechanism that causes some instances to bear a stronger resemblance to each other than they do to the remaining instances. It naturally requires different techniques to the classification and association learning methods .Clustering has many applications in various fields. In Software engineering it helps in reverse engineering, software maintenance and for re-building systems. It aims at breaking a larger problem into small pieces of understanding elements. There are many approaches available to carry out clustering. Since clustering has no particular methodology there are many methods available for carrying out clustering. There are many traditional as well as evolutionary methods available for carrying out clustering. In this paper various types of the above mentioned methods are described and some of them are compared. Each method has its own advantage and they can be used according to the needs of the user. Keywords: Clustering, Classification, Software Engineering, Traditional, Evolutionary.
Clustering of high dimensionality data which can be seen in almost all fields these days is becoming
very tedious process. The key disadvantage of high dimensional data which we can pen down is curse
of dimensionality. As the magnitude of datasets grows the data points become sparse and density of
area becomes less making it difficult to cluster that data which further reduces the performance of
traditional algorithms used for clustering. Semi-supervised clustering algorithms aim to improve
clustering results using limited supervision. The supervision is generally given as pair wise
constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors [2]. In this paper, we unify vector-based and graph-based
approaches. We first show that a recently-proposed objective function for semi-supervised clustering
based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of
constraint penalty functions, can be expressed as a special case of the global kernel k-means objective
[3]. A recent theoretical connection between global kernel k-means and several graph clustering
objectives enables us to perform semi-supervised clustering of data. In particular, some methods have
been proposed for semi supervised clustering based on pair wise similarity or dissimilarity
information. In this paper, we propose a kernel approach for semi supervised clustering and present in
detail two special cases of this kernel approach.
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
Attribute Selection is an important topic in Data Mining, because it is the effective way for reducing dimensionality, removing irrelevant data, removing redundant data, & increasing accuracy of the data. It is the process of identifying a subset of the most useful attributes that produces compatible results as the original entire set of attribute. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called a cluster are more similar in some sense or another to each other than to those in other groups (Clusters). There are various approaches & techniques for attribute subset selection namely Wrapper approach, Filter Approach, Relief Algorithm, Distributional clustering etc. But each of one having some disadvantages like unable to handle large volumes of data, computational complexity, accuracy is not guaranteed, difficult to evaluate and redundancy detection etc. To get the upper hand on some of these issues in attribute selection this paper proposes a technique that aims to design an effective clustering based attribute selection method for high dimensional data. Initially, attributes are divided into clusters by using graph-based clustering method like minimum spanning tree (MST). In the second step, the most representative attribute that is strongly related to target classes is selected from each cluster to form a subset of attributes. The purpose is to increase the level of accuracy, reduce dimensionality; shorter training time and improves generalization by reducing over fitting.
automatic classification in information retrievalBasma Gamal
automatic classification in information retrieval-automatic classification of documents
Chapter 3 from IR_VAN_Book
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN B.Sc., Ph.D., M.B.C.S.
Front End Data Cleaning And Transformation In Standard Printed Form Using Neu...ijcsa
Front end of data collection and loading into database manually may cause potential errors in data sets and a very time consuming process. Scanning of a data document in the form of an image and recognition of corresponding information in that image can be considered as a possible solution of this challenge. This paper presents an automated solution for the problem of data cleansing and recognition of user written data to transform into standard printed format with the help of artificial neural networks. Three different neural models namely direct, correlation based and hierarchical have been developed to handle this issue. In a very hostile input environment, the solution is developed to justify the proposed logic.
Expert system of single magnetic lens using JESS in Focused Ion Beamijcsa
This work shows expert system of symmetrical single magnetic lens used in focused ion beam optical system. Java expert system shell(JESS) programming is proposed to build the intelligent agent "MOPTION"for getting an optimum magnetic flux density , and calculate the ion optical trajectory. The combination of such rule based engine and SIMION 8.1 has configured the reconstruction process and compiled the data retrieved by the proposed expert system agent to implement the pole-pieces reconstruction for lens design. The pole pieces reconstruction has been resulted in 3D graph , and under the infinite magnification conditions of the optical path, aberration (spherical / chromatic and total) disks diameters have been obtained and got the values (0.03,0.13 and 0.133) micron (μm) respectively.
Gold Design India is one of a leading Golf Course Architects and Designers in India provides construction, maintenance and management of Golf course in India
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
Data analysis plays a prominent role in interpreting various phenomena. Data mining is the process to hypothesize useful knowledge from the extensive data. Based upon the classical statistical prototypes the data can be exploited beyond the storage and management of the data. Cluster analysis a primary investigation with little or no prior knowledge, consists of research and development across a wide variety of communities. Cluster ensembles are melange of individual solutions obtained from different clusterings to produce final quality clustering which is required in wider applications. The method arises in the perspective of increasing robustness, scalability and accuracy. This paper gives a brief overview of the generation methods and consensus functions included in cluster ensemble. The survey is to analyze the various techniques and cluster ensemble methods.
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
As a primary data mining method for knowledge discovery, clustering is a technique of classifying a dataset into groups of similar objects. The most popular method for data clustering K-means suffers from the drawbacks of requiring the number of clusters and their initial centers, which should be provided by the user. In the literature, several methods have proposed in a form of k-means variants, genetic algorithms, or combinations between them for calculating the number of clusters and finding proper clusters centers. However, none of these solutions has provided satisfactory results and determining the number of clusters and the initial centers are still the main challenge in clustering processes. In this paper we present an approach to automatically generate such parameters to achieve optimal clusters using a modified genetic algorithm operating on varied individual structures and using a new crossover operator. Experimental results show that our modified genetic algorithm is a better efficient alternative to the existing approaches.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
Currently, there are two techniques used for large-scale gene-expression profiling; microarray and
RNA-Sequence (RNA-Seq).This paper is intended to study and compare different clustering algorithms that used
in microarray data analysis. Microarray is a DNA molecules array which allows multiple hybridization
experiments to be carried out simultaneously and trace expression levels of thousands of genes. It is a highthroughput
technology for gene expression analysis and becomes an effective tool for biomedical research.
Microarray analysis aims to interpret the data produced from experiments on DNA, RNA, and protein
microarrays, which enable researchers to investigate the expression state of a large number of genes. Data
clustering represents the first and main process in microarray data analysis. The k-means, fuzzy c-mean, selforganizing
map, and hierarchical clustering algorithms are under investigation in this paper. These algorithms
are compared based on their clustering model.
Data mining , knowledge discovery is the process
of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase
revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize
it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among
dozens of fields in large relational databases. The goal of
clustering is to determine the intrinsic grouping in a set of
unlabeled data. But how to decide what constitutes a good
clustering? It can be shown that there is no absolute “best”
criterion which would be independent of the final aim of the
clustering. Consequently, it is the user which must supply this
criterion, in such a way that the result of the clustering will suit
their needs.
For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier
detection).Of late, clustering techniques have been applied in the
areas which involve browsing the gathered data or in categorizing
the outcome provided by the search engines for the reply to the
query raised by the users. In this paper, we are providing a
comprehensive survey over the document clustering.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Recent Trends in Incremental Clustering: A ReviewIOSRjournaljce
This paper presents a review on recent trends in incremental clustering algorithms. It tries to focus on both clustering based on similarity measure and clustering not based on similarity measure. In this context, the paper is devoted to various typical incremental clustering algorithms. Mainly optimization, genetic and fuzzy approaches of these algorithms is covered in the paper. The paper is original with respect to one aspect that is, it provides a complete overview that is fully devoted to evolutionary algorithms for incremental clustering. A number of references are provided that describe applications of evolutionary algorithms for incremental clustering in different domains, such as human activity detection, online fault detection, information security, track an object consistently throughout the network solving boundary problem etc.
Engineering Research Publication
Best International Journals, High Impact Journals,
International Journal of Engineering & Technical Research
ISSN : 2321-0869 (O) 2454-4698 (P)
www.erpublication.org
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERING
1. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
A SURVEY ON OPTIMIZATION APPROACHES TO
TEXT DOCUMENT CLUSTERING
R.Jensi1 and Dr.G.Wiselin Jiji2
1
Research Scholar, Manomanium Sundaranar University, Tirunelveli, India
2
Dr.Sivanthi Aditanar College of Engineering,Tiruchendur,India
ABSTRACT
Text Document Clustering is one of the fastest growing research areas because of availability of huge
amount of information in an electronic form. There are several number of techniques launched for
clustering documents in such a way that documents within a cluster have high intra-similarity and low
inter-similarity to other clusters. Many document clustering algorithms provide localized search in
effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained
by applying high-speed and high-quality optimization algorithms. The optimization technique performs a
globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to
text document clustering is turned out.
KEYWORDS
Text Clustering, Vector Space Model, Latent Semantic Indexing, Genetic Algorithm, Particle Swarm
Optimization, Ant-Colony Optimization, Bees Algorithm.
1. INTRODUCTION
With the advancement of technologies in World Wide Web, huge amounts of rich and dynamic
information’s are available. With web search engines, a user can quickly browse and locate the
documents. Usually search engines returns many documents, a lot of which are relevant to the
topic and some may contain irrelevant documents with poor quality. Cluster analysis or clustering
plays an important role in organizing such massive amount of documents returned by search
engines into meaningful clusters. A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to objects in other clusters. Document
clustering is closely related to data clustering. Data clustering is under vigorous development.
Cluster analysis has its roots in many data mining research areas, including data mining,
information retrieval, pattern recognition, web search, statistics, biology and machine learning.
Even if a lot of significant research effort has been done in [3,15,20,22,26,33], the more
challenges in clustering is to improve the quality of clustering process.
In machine learning, clustering [3] is an example of unsupervised learning. In the context of
machine learning, clustering contracts with supervised learning (or classification).
Classification assigns data objects in a collection to target categories or classes. The main task of
classification is to precisely predict the target class for each instance in the data. So classification
algorithm requires training data. Unlike classification, clustering does not require training data.
Clustering does not assign any per-defined label to each and every group. Clustering groups a set
of objects and finds whether there is some relationship between the objects.
DOI:10.5121/ijcsa.2013.3604
31
2. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
As stated by [3], [29], Data clustering algorithms can be broadly classified into following
categories:
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Frequent pattern-based clustering
Constraint-based clustering
With partitional clustering the algorithm creates a set of data non-overlapping subsets (clusters)
such that each data object is in exactly one subset. These approaches require selecting a value for
the desired number of clusters to be generated. A few popular heuristic clustering methods are kmeans and a variant of k-means-bisecting k-means, k-medoids, PAM (Kaufman and Rousseeuw,
1987), CLARA (Kaufmann and Rousseeuw, 1990), CLARANS (Ng and Han, 1994) etc.
With hierarchical clustering the algorithm creates a nested set of clusters that are organized as a
tree. Such hierarchical algorithms can be agglomerative or divisive. Agglomerative algorithms,
also called the bottom-up algorithms, initially treat each object as a separate cluster and
successively merge the couple of clusters that are close to one another to create new clusters until
all of the clusters are merged into one. Divisive algorithms, also called the top-down algorithms,
proceed with all of the objects in the same cluster and in each successive iteration a cluster is
split up using a flat clustering algorithm recursively until each object is in its own singleton
cluster. The popular hierarchical methods are BIRCH [38], ROCK [39], Chamelon [37] and
UPGMA.
An experimental study of hierarchical and partitional clustering algorithms was done by [37] and
proved that bisecting kmeans technique works better than the standard kmeans approach and the
hierarchical approaches.
Density-based clustering methods group the data objects with arbitrary shapes. Clustering is done
according to a density (number of objects), (i.e.) density-based connectivity. The popular densitybased methods are DBSCAN and its extension, OPTICS and DENCLUE [3].
Grid-based clustering methods use multiresolution grid structure to cluster the data objects. The
benefit of this method is its speed in processing time. Some examples include STING,
WaveCluster.
Model-based methods use a model for each cluster and determine the fit of the data to the given
model. It is also used to automatically determine the number of clusters. ExpectationMaximization, COBWEB and SOM (Self-Organizing Map) are typical examples of model-based
methods.
Frequent pattern-based clustering uses patterns which are extracted from subsets of dimensions,
to group the data objects. An example of this method is pCluster.
Constraint-based clustering methods perform clustering based on the user-specified or
application-specific constraints. It imposes user’s constraints on clustering such as user’s
requirement or explains properties of the required clustering results.
32
3. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
A soft clustering algorithm such as fuzzy c-means [26], [42] has been applied in [30], [31], [43]
for high-dimensional text document clustering. It allows the data object to belong to two or more
clusters with different membership. Fuzzy c-means is based on minimizing the dissimilarity
function. In each iteration it finds the cluster centroid that minimizes the objective function
(dissimilarity function) [20].
2. SOFT COMPUTING TECHNIQUES
Soft computing [40] is foundation of conceptual intelligence in machines. Soft Computing is an
approach for constructing systems which are computationally intelligent, possess human like
expertise in particular domain, adapt to the changing environment and can learn to do better and
can explain their decisions.Unlike hard computing, soft computing is tolerant of imprecision,
uncertainty, partial truth, and approximation. In effect, the role model for soft computing is the
human mind.
The realm of soft computing encompasses fuzzy logic and other meta-heuristics such as genetic
algorithms, neural networks, simulated annealing, particle swarm optimization, ant colony
systems and parts of learning theory as shown in Figure 2.1 (Chen 2002). It is expected that soft
computing techniques have received increasing attention in recent years for their interesting
characteristics and their success in solving problems in a number of fields.
The Components of soft computing is shown in figure 1.
Soft computing techniques have been applied to text document clustering as an optimization
problem. An innovative field has developed many hybrid ways of optimization techniques such as
PSO and ACO with FCM and K-means [6, 17, 18, 41] to further improve efficiency of document
clustering.
In traditional clustering techniques (k-means and their variants, FCM, etc.) for clustering, some
parameters must be known in advance like number of clusters. By using optimization techniques
clustering quality can be improved.
The most popularly used optimization techniques to document clustering are listed below:
2.1 Genetic Algorithm (GA)
The Genetic Algorithm (GA) [12] is a very popular evolutionary algorithm that was first
pioneered by John Holland in 1970s. The basic idea of GAs is designed to make artificial systems
software that retains the robustness of natural biological evolution system. Genetic algorithms
belong to search techniques that mimic the principle of natural selection. The performance of GA
is influenced by the following three operators:
1. crossover
2. mutation
3. selection
The selection operator is based on a fitness function; parents are selected for mating (i.e.
recombination or crossover) according to their fitness. Based on the selection operator, the better
the chromosomes would be included in the next populations and the others would be eliminated.
The popularly used methods that can be used to select the best chromosomes are roulette wheel
selection, tournament selection, Boltzman selection, Steady state selection and rank selection. The
crossover is applied and it selects genes from parent chromosomes in order to create the new
population. The crossover point is selected with probability Pc. After a crossover is performed,
33
4. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
mutation takes place. A mutation can be applied by randomly modifying bits. The proportion of
mutation is probability-based such as Pm. The working principle of GA is given below:
1. Generate initial population;
2. Evaluate population;
3. Repeat until stopping criterion is met
{
a. Select parents for reproduction using selection operator;
b. Perform crossover and mutation using genetic operator;
c. Evaluate population fitness function;
}
2.2 Bees Algorithm (BA)
Another population-based search algorithm is the bees algorithm [35]. The bees algorithm [44] is
an optimization algorithm first developed in 2005 and it is based on the natural food foraging
activities of honey bees. As a first step, an initial population of solutions are created and then
evaluated based on a fitness function. Then based on this evaluation, the highest finesses values
are selected for neighbourhood search, assigning more bees to search bear to the best solutions of
the highest fitness’s. Then for each patch select the fittest solution to build the next new
population. The remaining bees in the population are allocated randomly around the search space
scouting for new possible solutions. These steps are repeated until a stopping criterion is met.
The Algorithm steps are given below:
1. Generate initial population solutions randomly (n).
2. Evaluate fitness of the population.
3. Continue below steps until stopping criterion is met
{
a. Choose highest fitnesses (m) and allow them for neighborhood search.
b. Recruit more bees to search near to the highest fitness’s and evaluate fitnesses.
c. Select the fittest bee from each patch.
d. Remaining bees are assigned for random search and do the evaluation.
}
The proper values for the parameters n,m are set.
34
5. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
2.3 Particle Swarm Optimization (PSO)
Particle Swarm Optimization (PSO) [14] [15] is also an evolutionary computation technique. It is
enthused by study of flocks of birds’ behaviours. A PSO algorithm starts by having a population
of candidate solutions. The population is called a swarm and each solution is called a particle.
Initially particles are assigned a random initial position and an initial velocity. The position of
each particle has a value obtained by the objective function. Particles are moved around in the
search-space. While moving in the search space, each particle remembers the best solution it has
achieved so far, called as the individual best fitness (pbest), and the position of the best solution,
called as the individual best fitness position. Also, it maintains the value that can be obtained so
far by any particle in the population, referred to as the global best fitness (gbest) and its position
is referred to as the global best fitness position. During this operation when improved positions
are being found then these positions will come to guide the movements of the swarm. In each
iteration, particles are updated according to two ―best‖ values. They are called pbest and gbest.
After calculating the two best values, the particle updates its velocity and positions by using the
following equations:
vel[ ] = vel[ ] + l1 * rand() * (pbest[ ] - current[ ]) + l2 * rand() * (gbest[ ] - current[ ])
current[ ] = current[ ] + vel[ ]
where
vel[ ]
: the particle velocity
current [ ] : the current particle (solution)
pbest[ ] : personal best value
gbest[ ] : global best value
rand () : random number between (0,1)
l1, l2
: learning factors.
The following steps are done in PSO algorithm:
1. Initialize each particle in the population with random positions and velocities.
2. Repeat the following steps until stopping criterion is met.
i. for each particle
{
Calculate the fitness function value;
Compare the fitness value: If it is superior to the best fitness value pbest,
then current value is assigned pbest value;
}
ii. Best fitness value particles among all the particles are selected and assign it as gbest;
ii.
for each particle
{
Calculate particle velocity;
Change the position of the particle;
}
35
6. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
2.4 Ant Colony Optimization (ACO)
Natural ant behaviour was first modelled by using Ant-based clustering algorithms. Ant Colony
Optimization was initially inspired by Deneubourg in 1990 in aiming to search for an optimal
path in a graph, based on the behaviour of ants .
In nature, ants find the shortest path between their hole and a source of food. In order to show
their paths, Ants use a substance called Pheromone. The ant colony optimization algorithm [21]
presents a random search method. Initially, a two-dimensional m*m space is considered. The size
of this space is selected larger enough to hold the clustered elements. According to the cumber of
data objects to be clustered, the number of ants and the value m will be determined. For example
if n is the number of data elements to be clustered, then the number of ants will be m=4n, n/3. The
repetitive process includes a stack of elements which are created based on the idea that the ant
collects the similar adjacent elements. Then larger stacks are created by combining smaller stacks.
In the end, clusters are the final obtained stacks.
In the following sections document representation model for clustering, weighting scheme,
evaluation criteria and dimension reductions are discussed. Next, we also put forward
optimization approaches related to document clustering as a segment in the survey.
3. DOCUMENT CLUSTERING
Clustering of documents is used to group documents into relevant topics. The major difficulty in
document clustering is its high dimension. It requires efficient algorithms which can solve this
high dimensional clustering. A document clustering is a major topic in information retrieval area
.Example includes search engines. The basic steps used in document clustering process are
shown in figure 2.
Figure 2.Flow diagram for representing basic Steps in text clustering
3.1 Peprocessing
The text document preprocessing basically consists of a process to strip all formatting from the
article, including capitalization, punctuation, and extraneous markup (like the dateline,tags).
Then the stop words are removed. Stop words term (i.e., pronouns, prepositions, conjunctions etc)
are the words that don't carry semantic meaning. Stop words can be eliminated using a list of stop
words. Stop words elimination using a list of stop word list will greatly reduce the amount of
noise in text collection, as well as make the computation easier. The benefit of removing stop
words leaves us with condensed version of the documents containing content words only.
The next process is to stem a word. Stemming is the process for reducing some derived words
into their root form. For English documents, a popularly known algorithm called the Porter
36
7. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
stemmer [7] is used. The performance of text clustering can be improved by using Porter
stemmer.
3.2 Text Document Encoding
The next process is to encode the text document. In general, documents are transformed into
document term matrix (DTM) which is a mathematical matrix whose dimensions are the terms
and rows are documents. A simple DTM is the vector space model [1] model which is widely
used in IR and text mining [23], to represent the text documents. It is used in indexing,
information retrieval and relevancy rankings and can be successfully used in evaluation of search
results from web search engines.
Let D = (D1, D2, . . . ,DN) be a collection of documents and T=(T1, T2, . . . , TM) be the collection
of terms of the document collection D, where N is the total number of documents in the collection
and M is the number of distinct terms. In this model each document Di is represented by a point in
an m dimensional vector space, Di = (wi1, wi2, . . . ,wim), i = 1, . . . , N, where the dimension is the
total number of unique terms in the document collection. Many schemes have been proposed for
measuring wij values, also known as (term) weights. One of the more advanced term weighting
schemes is the tf-idf (term frequency-inverse document frequency) [23]. The tf-idf scheme aims
at balancing the local and the global weighting of the term in the document and it is calculated by
n
wij tf ij log
df
j
(1)
where tfij is the frequency of term i in document j, and dfij denotes the number of documents in
which term j appears. The component log (n/dfij), which is often called the idf factor, defines the
global weight of the term j.
3.3 Dimension reduction techniques
Dimension reduction can be divided into feature selection and feature extraction. Feature
selection is the process of selecting smaller subsets (features) from larger set of inputs and
Feature extraction transforms the high dimensional data space to a space of low dimension. The
goals of dimension reduction methods are to allow fewer dimensions for broader comparisons of
the concepts contained in a text collection.
In this paper, one dimension reduction technique is discussed.
3.3.1 Latent Semantic Indexing (LSI)
A popular text retrieval technique is Latent Semantic Indexing, which uses Singular value
decomposition (SVD) to identify patterns in the relationships between the terms and concepts
contained in a collection of text. The singular value decomposition reduces the dimensions by
selecting dimensions with highest singular values. For text processing LSI can be effectively used
because it preserves the polysemy and synonymy in the text. A key feature of LSI is its ability to
retain latent structure in word and hence it improves the clustering efficiency.
Once a term-document matrix X (M×N) is constructed, assuming there are m distinct terms and n
documents. The Singular Value Decomposition computes the term and document vectors by
transforming TDM matrix X into three matrices P, S and D, which is given by
37
8. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
X=PSQT
(2)
where
P : left singular vector matrix
Q : right singular vector matrix
S : diagonal matrix of singular values.
LSI approximates X with a rank k matrix.
Xk=PkSkQTk
(3)
where Pk is defined to be the first k columns of the matrix P and QTk is included the first k rows of
matrix QT. Sk=diag(s1,…..sk) is the first k largest singular values.
When LSI is used for text document clustering, a document Di is represented by [11]
Di=DTiPk
(4)
Then the text corpus can be organized by another representation of document-term matrix
D(N×M) and the corpus matrix is organized by
C=DPk
(5)
3.4 Similarity Measurement
The similarity (or dissimilarity) between the objects is typically computed based on the distance
between document pairs. The most popular distance measure as stated by [3] are:
Table 1. Formulas for Similarity Measurement
Name
Euclidean Distance
Formula
where i=(xi1,xi2,…,xin) and j=(xj1,xj2,…,xjn)
Manhattan distance
Minkowski distance
where p is a positive integer
J(A,B) = (A B)/(A B)
where A and B are documents
Jaccard coefficient
Cosine Similarity
S(x,y)=
where xt is a transposition of vector x,
is the Euclidean norm of
vector x,
is the Euclidean norm of vector y
3.5 Evaluation of Text Clustering
The quality of text clustering can be measured by using the popularly used external
indexes [22]: F-measure, Purity and Entropy. These measures are called external quality measures
38
9. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
because the results of clustering techniques are compared with known classes (i.e) it requires that
the documents be given class labels in advance. A widely used quality measures for the purpose
of text document clustering [22] are F-measure, Purity and Entropy which can be defined as
follows:
3.5.1. F-measure
The F-measure combines the precision and recall values used in information retrieval. The
precision P(i,j) and recall R(i,j) of each cluster j for each class i are calculated as
(6)
(7)
where
βi : is the number of members of class i
βj : is the number of members of cluster j
βij: is the number of members of class i in cluster j
The corresponding F-measure F(i,j) is given by the following equation:
(8)
Then the F-measure of a class i can be defined as
(9)
where n is the total number of documents in the collection. In general, the larger the F-measure
gives the better clustering result.
3.5.2. Purity
The purity measure of a cluster represents the percentage of correctly clustered documents and
thus the purity of a cluster j is defined as
(10)
The overall purity of a clustering is a weighted sum of the cluster purities and is defined as
(11)
In general, the better clustering result is given by the larger the purity value.
39
10. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
3.5.3. Entropy
The Entropy of a cluster can be defined as the degree to which each cluster consists of objects of
a single class. The entropy of a cluster j is calculated using the standard formula,
(12)
where
L : Number of classes
: Probability that a member of cluster j belongs to class i.
The total entropy of the overall clustering result is defined to be the weighted sum of the
individual entropy value of each cluster. The total entropy e is defined as
(13)
where
k : Number of clusters
n : Total number of documents in the corpus.
In general, the better clustering result is given by the smaller entropy value.
3.6 Datasets
For evaluating the effectiveness of the document clustering algorithms, several text collections
are available. These collections are useful for research in information retrieval, natural language
processing, computational linguistics and other corpus-based research. The following real text
datasets have been selected for clustering purpose. The datasets are:
Reuters-21578: Reuters-21578 test collection contains 21578 text documents. The documents in
the Reuters-21578 collection are originally taken from Reuters newswire in 1987. The Reuters21578 contains 22 files. Each of the first 21 files (reut2-000.sgm through reut2-020.sgm) contains
1000 documents, while the last (reut2-021.sgm) contains 578 documents. The documents are
broadly divided into five broad categories (Exchanges, People, Topics, Organizations and Places).
These categories are further divided into subcategories. The Reuters-21578 test collection is
available at [9].
20NewsGroup: 20 Newsgroups data set contained 20,000 newsgroup articles from 20 newsgroups
on a variety of topics. The dataset is available in http://www.cs.cmu.edu/afs/cs/project/theo20/www/data/news20.html. It is also available in the UCI machine learning dataset repository
available at http://mlg.ucd.ie/datasets/20ng.html. This dataset was assembled by Ken Lang.
Hamshahri : Hamshahri collection was used for evaluation of Persian information retrieval
systems. Hamshahri collection consists of about 160,000 text documents and 100 queries in
Persian and English languages. There are totally 50350 query document pairs upon which
relevance judgements are made. The judgement result is binary judges, either ―1‖ (relevant) or
―0‖ (irrelevant).
40
11. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
4. RELATED WORKS
Document clustering had been widely studied in computer science literature. Several soft
computing techniques have been used for text document clustering and some of these are
discussed here.
Wei Song, Soon Cheol Park [11] proposed a variable string length genetic algorithm for
automatically evolving the proper number of clusters as well as providing near optimal data set
clustering. GA was used in conjunction with the reduced latent semantic structure to improve
clustering efficiency and accuracy. The objective function used was Davis-Bouldin index. It
produces much lower cost of computational time due to the reduction of dimensions.
K. Premalatha, A.M. Natarajan [17] presented a new hybrid model of clustering based on GA
and PSO which were used to solve document clustering problem. The use of Genetic Algorithm
in this model is to attain an optimal solution. Due to the simplicity of PSO and efficiency in
navigating large search spaces for optimal solution, it is combined with GA. This hybrid model
avoided the premature convergence and improved the diversity.
Eisa Hasanzadeh [27], [32] developed a text clustering algorithm based on PSO and applied
latent semantic indexing (PSO+LSI) for reducing dimension. Latent Semantic Indexing (LSI)
was used to reduce the high dimension of textual data. Because of the main problem of text
clustering algorithm is very high dimension; it is avoided by using LSI. PSO family of bioinspired algorithms had successfully been merged with LSI. [27] used an adaptive inertia weight
(AIW) that does proper exploration and exploitation in search space. This model produced
better clustering results over PSO+Kmeans using vector space model. The proposed work
PSO+LSI are faster than PSO+Kmeans algorithms using the vector space model for all numbers
of dimensions.
Stuti Karol , Veenu Mangat [43] introduced hybrid PSO based algorithm. The two partitioning
clustering algorithms Fuzzy C-Means (FCM) and K- Means each hybridized with Particle
Swarm Optimization (KPSO and FCPSO). The performance of hybrid algorithms provided
better document clusters against traditional partitioning clustering techniques (K-Means and
Fuzzy C Means) without hybridization. It is concluded that FCPSO deals better with
overlapping nature of dataset than KPSO as it deals well with the overlapping nature of
documents.
Nihal M. AbdelHamid, M. B. Abdel Halim, M. Waleed Fakhr[36] introduced the Bees
Algorithm in optimizing the document clustering problem. The Bees algorithm avoids local
minima convergence by performing global and local search simultaneously. This proposed
algorithm has been tested on a data set containing 818 documents and the results have revealed
that the algorithm achieved its robustness. This model was compared with the Genetic
Algorithm and K-means and it was concluded that Bees algorithm outperforms GA by 15% and
the K-means by 50%. And also the results revealed that the Bees Algorithm takes more time
than the Genetic Algorithm by 20% and the K-means by 55%.
Kayvan Azaryuon,Babak Fakhar [45] proposed an upgraded the standard ant’s clustering
algorithm by changing the Ants’ movement completely random for clustering. This model
provided increasing quality and minimizing run time when compared with the standard ant
clustering algorithm and the K-means algorithm. His proposed algorithm has been implemented
to cluster documents in the Reuters-21578. The Results have shown that the proposed algorithm
presents a better average performance than the standard ants clustering algorithm, and the Kmeans algorithm.
41
12. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
Table 2. Comparison of various Optimization based document clustering methods
Paper referred
Dimension
Reduction
applied
Yes
Dataset used
Evaluation
parameters
Reuters-21578
F-measure
No
Silvia
et al 2003
Fitness value
Eisa Hasanzadeh,
Morteza Poyan rad
and
Hamid
Alinejad
Rokny[27] [32]
Stuti Karol , Veenu
Mangat[43]
Yes
Reuters-21578
Hamshahri
F-measure
ADDC (Average
Distance Documents to
the cluster Centroid)
No
Reuters-21578
F-measure
Entropy
ADDC (Average
Distance Documents to
the cluster Centroid)
Nihal M.
AbdelHamid, M. B.
Abdel Halim, M.
Waleed Fakhr [36]
No
Fitness value
Kayvan Azaryuon,
Babak Fakhar[45]
No
Web Articles:
www.cnn.com
www.bbc.com
news.google.com
www.2knowmyse
lf.comm
Reuters-21578
Wei Song, Soon
Cheol Park
[11]
K. Premalatha,
A.M. Natarajan[17]
Fitness function used
Davis-Bouldin Index
Cosine Distance
F-measure
---
5. CONCLUSION
This paper has presented a survey on the research work done on text document clustering based
on optimization techniques. This survey starts with a brief introduction about clustering in data
mining, soft computing and explored various research papers related to text document clustering.
More research works have to be carried out based on semantic to make the quality of text
document clustering.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Gerard M. Salton, Wong. A & Chung-Shu Yang , (1975), ―A vector space Model for automatic
indexing‖, Communications of The ACM, Vol.18,No.11,pp. 613-620.
Henry Anaya-Sánchez, Aurora Pons-Porrata & Rafael Berlanga- Llavori, (2010), ―A document
clustering algorithm for discovering and describing topics‖, Elsevier, Pattern Recognition Letters,Vol.
31 ,No.15,pp. 502–510.
Jiawei han & Michelin Kamber ,(2010),‖Data mining concepts and techniques‖,Elsevier.
Berry, M. (ed.), (2003), "Survey of Text Mining: Clustering, Classification, and Retrieval", Springer,
New York.
Xu, R.& Wunsch, D, (2005),‖Survey of Clustering Algorithms‖, IEEE Trans. on Neural Networks
Vol.16, No.3, pp. 645-678
Xiaohui Cui & Thomas E. Potok,(2005), ―Document Clustering Analysis Based on Hybrid PSO+Kmeans Algorithm‖, Journal of Computer Sciences (Special Issue): 27-33, Science Publications
Porter, M.F., (1980), ―An algorithm for suffix stripping. Program‖, 14: 130-137.
T.E. Potok & Cui X, (2005), ―Document clustering using particle swarm optimization‖, in
proceedings of the IEEE Swarm Intelligence Symposium, .pp.185-191.
http://www.daviddlewis.com/resources/testcollections/reuters21578.
42
13. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
[10] Lotfi A & Zadeh, (1994),"Fuzzy Logic, Neural Networks, and Soft Computing," Communication of
the ACM, Vol. 37 No. 3, pp. 77-84.
[11] Wei Song & Soon Cheol Park,(2009), ―Genetic Algorithm for text clustering based on latent semantic
indexing, Computers and Mathematics with applications‖, 57, 1901-1907.
[12] Goldberg D.E.,(1989), ―Genetic Algorithms-in Search, Optimization and Machine Learning‖,
Addison- Wesley Publishing Company Inc., London.
[13] Wei Song & Soon Cheol Park, (2006), ―Genetic algorithm-based text clustering technique‖, in
Lecture Notes in Computer Science, pp.779-782.
[14] R. Eberhart & J. Kennedy, (1995) " Particle swarm optimization‖, in proceedings of the IEEE
International conference on Neural Networks, Vol.4,pp. 1942–1948.
[15] Van der Merwe.D.W & Engelbrecht A.P. , (2003), ―Data clustering using particle swarm
optimization‖, in Proc. IEEE Congress on Evolutionary Computation,Vol.1, pp. 215-220.
[16] Nihal M. AbdelHamid, M.B. AbdelHalim & M.W. Fakhr,(2003), ―Document clustering using Bees
Algorithm‖, International Conference of Information Technology, IEEE, Indonesia .
[17] K. Premalatha & A.M. Natarajan, (2010), ―Hybrid PSO and GA Models for Document Clustering‖,
International Journal of Advanced Soft Computing Applications, Vol.2, No.3, pp. 2074-8523.
[18] Shi, X.H., Liang Y.C., Lee H.P., Lu C. & Wang L.M.,(2005) ―An Improved GA and a Novel PSOGA-Based Hybrid Algorithm‖, Elsevier , Information Processing Letters, Vol. 93,No. 5, pp.255-261.
[19] Kennedy, J. & Eberhart, R.C, (2001), ―Swarm Intelligence‖, Morgan Kaufmann 1-55860-595-9
[20] Jain, A.K., M.N. Murty & P.J. Flynn, (1999), ―Data clustering: A review‖, ACM Computing Survey,
31:264-323.
[21] Priya Vaijayanthi, Natarajan A M & Raja Murugados, (2012), ― Ants for Document Clustering‖,
1694-0814 .
[22] Michael Steinbach , George Karypis & Vipin Kumar, (2000), ―A comparison of document clustering
techniques‖, in KDD Workshop on TextMining.
[23] Ricardo Baeza-Yates & Berthier Ribeiro-Neto, (1999), ―Modern Information Retrieval‖, ACM Press,
New York, Addison Wesley.
[24] Shafiei M., Singer Wang, Zhang.R., Milios.E, Bin Tang, Tougas.J & Spiteri.R, (2007), ―Document
Representation and Dimension Reduction for Text Clustering‖ , in IEEE Int. Conf. on Data Engg.
Workshop, pp. 770-779.
[25] Wei Song , Cheng Hua Li & Soon Cheol Park,(2009), ―Genetic algorithm for text clustering using
ontology and evaluating the validity of various semantic similarity measures‖, Elsevier, Expert
Systems with Applications, Vol.36,No.5,pp. 9095–9104.
[26] J. Bezdek , R.Ehrlich & W. Full ,(1984), ―FCM: The Fuzzy C-Means Clustering Algorithm‖,
Computers & Geosciences, Vol. 10, No. 2-3, pp. 191-203.
[27] Eisa Hasanzadeh & Hamid Hasanpour, (2010),―PSO Algorithm for Text Clustering Based on Latent
Semantic Indexing‖, The Fourth Iran Data Mining Conference , Sharif University of Technology,
Tehran, Iran
[28] Magnus Rosell , (2006), ―Introduction to Information Retrieval and Text Clustering‖, KTH CSC.
[29] Tan,Steinbach & kumar, (2009), ―Introduction to data mining‖, Pearson education .
[30] Trappey.A.J.C.,Trappey.C.V.,Fu-Chiang Hsu & Hsiao.D.W.,(2009), ―A Fuzzy Ontological
Knowledge Document Clustering Methodology‖, IEEE transactions on systems, man, and
cybernetics—part B: Cybernetics, Vol. 39, No. 3,pp. 806-814.
[31] Jiayin KANG & Wenjun ZHANG, (2011), ―Combination of Fuzzy C-means and Harmony Search
Algorithms for Clustering of Text Document‖, Journal of Computational Information Systems,Vol.
7,No. 16,pp. 5980-5986.
[32] Eisa Hasanzadeh, Morteza Poyan rad and Hamid Alinejad Rokny,(2012),‖Text clustering on latent
semantic indexing with particle swarm optimization (PSO) algorithm‖, International Journal of the
Physical Sciences Vol. 7(1), pp. 116 – 120.
[33] Nock.R & Nielsen.F.,(2006), ―On Weighting Clustering‖,IEEE Transactions On Pattern Analysis And
Machine Intelligence, Vol. 28, No. 8,pp.1223-1235.
[34] Shady, Fakhri Karray & Kamel.M.S.,(2010), ‖An Efficient Concept-Based Mining Model for
Enhancing Text Clustering‖ IEEE Trans. on Knowledge & Data Engg., Vol. 22, No. 10,pp. 13601371.
[35] Pham D.T., Ghanbarzadeh A., Koç E., Otri S., Rahim S., & M.Zaidi "The Bees Algorithm – A Novel
Tool for Complex Optimisation Problems",in proceedings of IPROMS Conference, pp.454–461.
43
14. International Journal on Computational Sciences & Applications (IJCSA) Vol.3, No.6, December 2013
[36] Nihal M. AbdelHamid, M. B. Abdel Halim, M. Waleed Fakhr,(2013), ―BEES ALGORITHM-BASED
DOCUMENT CLUSTERING‖, ICIT 2013 The 6th International Conference on Information
Technology.
[37] Karypis.G, Eui-Hong Han & Kumar.V., (1999), ―Chameleon: A Hierarchical Clustering Algorithm
using Dynamic Modelling‖, IEEE Computer, Vol. 32, No.8, pp. 68-75.
[38] Zhang.T., Raghu Ramakrishnan & Livny.M.,(1996), ―Birch: An Efficient Data Clustering Method for
very Large Databases‖, In Proceedings of the ACM SIGMOD international Conference on
Management of Data, pp. 103-114.
[39] S. Guha, R. Rastogi, and K. Shim,(1999), ―ROCK: A robust clustering algorithm for categorical
attributes‖, International Conference on Data Engineering (ICDE―99), pp. 512-521.
[40] S.N. Sivanandam and S.N.Deepa,(2008), ―Principles of Soft Computing‖, Wiley-India.
[41] Ivan Brezina Jr.Zuzana Čičková ,(2011), ―Solving the Travelling Salesman Problem Using the Ant
Colony OptimizationManagement Information Systems‖,Vol. 6, No. 4,pp. 010-014.
[42] Ming-Chuan Hung & Don-Lin Yang,(2001), ―An Efficient Fuzzy C-Means Clustering Algorithm‖,in
proceedings of the IEEE conference on Data mining,pp.225-232.
[43] Stuti Karol , Veenu Mangat, (2012), ― Evaluation of a Text Document Clustering Approach based on
Particle Swarm Optimization‖, CSI Journal of Computing , Vol. 1 , No.3.
[44] Pham D.T., Ghanbarzadeh A, Koc E, Otri S, Rahim S and Zaidi M,(2005), ―The Bees Algorithm‖,
Notes by Manufacturing Engineering Centre, Cardiff University, UK .
[45] Kayvan Azaryuon , Babak Fakhar,(2013), ―A Novel Document Clustering Algorithm Based on Ant
Colony Optimization Algorithm‖, Journal of mathematics and computer Science Vol.7 , pp. 171-180.
Authors
R.Jensi received the B.E (CSE) and M.E (CSE) in 2003 and 2010, respectively. Her research areas include
data mining, text mining especially text clustering, natural language processing and semantic analysis. She
is currently pursuing Ph.D(CSE) in Manomanium Sundaranar University , Tirunelveli. She is a member of
ISTE.
Dr.G.Wiselin Jiji received the B.E (CSE) and M.E (CSE) degree in 1994 and 1998, respectively. She
received her doctorate degree from Anna University in 2008. Her research interests include classification,
segmentation, medical image processing. She is a member of ISTE.
44