IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
A novel clustering algorithm CSHARP is presented for the purpose of finding clusters of arbitrary shapes and arbitrary densities in high dimensional feature spaces. It can be considered as a variation of the Shared Nearest Neighbor algorithm (SNN), in which each sample data point votes for the points in its k-nearest neighborhood. Sets of points sharing a common mutual nearest neighbor are considered as dense regions/ blocks. These blocks are the seeds from which clusters may grow up. Therefore, CSHARP is not a point-to-point clustering algorithm. Rather, it is a block-to-block clustering technique. Much of its advantages come from these facts: Noise points and outliers correspond to blocks of small sizes, and homogeneous blocks highly overlap. This technique is not prone to merge clusters of different densities or different homogeneity. The algorithm has been applied to a variety of low and high dimensional data sets with superior results over existing techniques such as DBScan, K-means, Chameleon, Mitosis and Spectral Clustering. The quality of its results as well as its time complexity, rank it at the front of these techniques.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
A Novel Approach to Mathematical Concepts in Data Miningijdmtaiir
-This paper describes three different fundamental
mathematical programming approaches that are relevant to
data mining. They are: Feature Selection, Clustering and
Robust Representation. This paper comprises of two clustering
algorithms such as K-mean algorithm and K-median
algorithms. Clustering is illustrated by the unsupervised
learning of patterns and clusters that may exist in a given
databases and useful tool for Knowledge Discovery in
Database (KDD). The results of k-median algorithm are used
to collecting the blood cancer patient from a medical database.
K-mean clustering is a data mining/machine learning algorithm
used to cluster observations into groups of related observations
without any prior knowledge of those relationships. The kmean algorithm is one of the simplest clustering techniques
and it is commonly used in medical imaging, biometrics and
related fields.
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
A novel clustering algorithm CSHARP is presented for the purpose of finding clusters of arbitrary shapes and arbitrary densities in high dimensional feature spaces. It can be considered as a variation of the Shared Nearest Neighbor algorithm (SNN), in which each sample data point votes for the points in its k-nearest neighborhood. Sets of points sharing a common mutual nearest neighbor are considered as dense regions/ blocks. These blocks are the seeds from which clusters may grow up. Therefore, CSHARP is not a point-to-point clustering algorithm. Rather, it is a block-to-block clustering technique. Much of its advantages come from these facts: Noise points and outliers correspond to blocks of small sizes, and homogeneous blocks highly overlap. This technique is not prone to merge clusters of different densities or different homogeneity. The algorithm has been applied to a variety of low and high dimensional data sets with superior results over existing techniques such as DBScan, K-means, Chameleon, Mitosis and Spectral Clustering. The quality of its results as well as its time complexity, rank it at the front of these techniques.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
Data clustering is a process of arranging similar data into groups. A clustering algorithm
partitions a data set into several groups such that the similarity within a group is better than
among groups. In this paper a hybrid clustering algorithm based on K-mean and K-harmonic
mean (KHM) is described. The proposed algorithm is tested on five different datasets. The research is focused on fast and accurate clustering. Its performance is compared with the traditional K-means & KHM algorithm. The result obtained from proposed hybrid algorithm is much better than the traditional K-mean & KHM algorithm
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute
the value
Packet Classification using Support Vector Machines with String KernelsIJERA Editor
Since the inception of internet many methods have been devised to keep untrusted and malicious packets away
from a user’s system . The traffic / packet classification can be used
as an important tool to detect intrusion in the system. Using Machine Learning as an efficient statistical based
approach for classifying packets is a novel method in practice today . This paper emphasizes upon using an
advanced string kernel method within a support vector machine to classify packets .
There exists a paper related to a similar problem using Machine Learning [2]. But the researches mentioned in
their paper are not up-to date and doesn’t account for modern day
string kernels that are much more efficient . My work extends their research by introducing different approaches
to classify encrypted / unencrypted traffic / packets .
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...csandit
In this paper, we thoroughly investigate correlations of eigenvector centrality to five centrality
measures, including degree centrality, betweenness centrality, clustering coefficient centrality,
closeness centrality, and farness centrality, of various types of network (random network, smallworld
network, and real-world network). For each network, we compute those six centrality
measures, from which the correlation coefficient is determined. Our analysis suggests that the
degree centrality and the eigenvector centrality are highly correlated, regardless of the type of
network. Furthermore, the eigenvector centrality also highly correlates to betweenness on
random and real-world networks. However, it is inconsistent on small-world network, probably
owing to its power-law distribution. Finally, it is also revealed that eigenvector centrality is
distinct from clustering coefficient centrality, closeness centrality and farness centrality in all
tested occasions. The findings in this paper could lead us to further correlation analysis on
multiple centrality measures in the near future
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...csandit
Floating point division, even though being an infrequent operation in the traditional sense, is
indis-pensable when it comes to a range of non-traditional applications such as K-Means
Clustering and QR Decomposition just to name a few. In such applications, hardware support
for floating point division would boost the performance of the entire system. In this paper, we
present a novel architecture for a floating point division unit based on the Taylor-series
expansion algorithm. We show that the Iterative Logarithmic Multiplier is very well suited to be
used as a part of this architecture. We propose an implementation of the powering unit that can
calculate an odd power and an even power of a number simultaneously, meanwhile having little
hardware overhead when compared to the Iterative Logarithmic Multiplier.
Improve the Performance of Clustering Using Combination of Multiple Clusterin...ijdmtaiir
The ever-increasing availability of textual
documents has lead to a growing challenge for information
systems to effectively manage and retrieve the information
comprised in large collections of texts according to the user’s
information needs. There is no clustering method that can
adequately handle all sorts of cluster structures and properties
(e.g. shape, size, overlapping, and density). Combining
multiple clustering methods is an approach to overcome the
deficiency of single algorithms and further enhance their
performances. A disadvantage of the cluster ensemble is the
highly computational load of combing the clustering results
especially for large and high dimensional datasets. In this paper
we propose a multiclustering algorithm , it is a combination of
Cooperative Hard-Fuzzy Clustering model based on
intermediate cooperation between the hard k-means (KM) and
fuzzy c-means (FCM) to produce better intermediate clusters
and ant colony algorithm. This proposed method gives better
result than individual clusters.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
Data clustering is a process of arranging similar data into groups. A clustering algorithm
partitions a data set into several groups such that the similarity within a group is better than
among groups. In this paper a hybrid clustering algorithm based on K-mean and K-harmonic
mean (KHM) is described. The proposed algorithm is tested on five different datasets. The research is focused on fast and accurate clustering. Its performance is compared with the traditional K-means & KHM algorithm. The result obtained from proposed hybrid algorithm is much better than the traditional K-mean & KHM algorithm
Vchunk join an efficient algorithm for edit similarity joinsVijay Koushik
Similarity join is most important technique to
involve many applications such as data integration, record
linkage and pattern recognition. Here we introduce new
algorithm for similarity join with edit distance constraints.
Currently extracting overlapping grams from string and consider
only string that share certain gram as candidate. Now we propose
extracting non-overlapping substring or chunk from string.
Chunk scheme based on tail-restricted chunk boundary
dictionary (CBD). This approach integrated existing approach
for calculating similarity with several new filters unique to chunk
based method. Greedy algorithm automatically select good
chunking scheme from given data set. Then show the result our
method occupies less space and faster performance to compute
the value
Packet Classification using Support Vector Machines with String KernelsIJERA Editor
Since the inception of internet many methods have been devised to keep untrusted and malicious packets away
from a user’s system . The traffic / packet classification can be used
as an important tool to detect intrusion in the system. Using Machine Learning as an efficient statistical based
approach for classifying packets is a novel method in practice today . This paper emphasizes upon using an
advanced string kernel method within a support vector machine to classify packets .
There exists a paper related to a similar problem using Machine Learning [2]. But the researches mentioned in
their paper are not up-to date and doesn’t account for modern day
string kernels that are much more efficient . My work extends their research by introducing different approaches
to classify encrypted / unencrypted traffic / packets .
Scalable Rough C-Means clustering using Firefly algorithm..................................................................1
Abhilash Namdev and B.K. Tripathy
Significance of Embedded Systems to IoT................................................................................................. 15
P. R. S. M. Lakshmi, P. Lakshmi Narayanamma and K. Santhi Sri
Cognitive Abilities, Information Literacy Knowledge and Retrieval Skills of Undergraduates: A
Comparison of Public and Private Universities in Nigeria ........................................................................ 24
Janet O. Adekannbi and Testimony Morenike Oluwayinka
Risk Assessment in Constructing Horseshoe Vault Tunnels using Fuzzy Technique................................ 48
Erfan Shafaghat and Mostafa Yousefi Rad
Evaluating the Adoption of Deductive Database Technology in Augmenting Criminal Intelligence in
Zimbabwe: Case of Zimbabwe Republic Police......................................................................................... 68
Mahlangu Gilbert, Furusa Samuel Simbarashe, Chikonye Musafare and Mugoniwa Beauty
Analysis of Petrol Pumps Reachability in Anand District of Gujarat ....................................................... 77
Nidhi Arora
CORRELATION OF EIGENVECTOR CENTRALITY TO OTHER CENTRALITY MEASURES: RANDOM, S...csandit
In this paper, we thoroughly investigate correlations of eigenvector centrality to five centrality
measures, including degree centrality, betweenness centrality, clustering coefficient centrality,
closeness centrality, and farness centrality, of various types of network (random network, smallworld
network, and real-world network). For each network, we compute those six centrality
measures, from which the correlation coefficient is determined. Our analysis suggests that the
degree centrality and the eigenvector centrality are highly correlated, regardless of the type of
network. Furthermore, the eigenvector centrality also highly correlates to betweenness on
random and real-world networks. However, it is inconsistent on small-world network, probably
owing to its power-law distribution. Finally, it is also revealed that eigenvector centrality is
distinct from clustering coefficient centrality, closeness centrality and farness centrality in all
tested occasions. The findings in this paper could lead us to further correlation analysis on
multiple centrality measures in the near future
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...csandit
Floating point division, even though being an infrequent operation in the traditional sense, is
indis-pensable when it comes to a range of non-traditional applications such as K-Means
Clustering and QR Decomposition just to name a few. In such applications, hardware support
for floating point division would boost the performance of the entire system. In this paper, we
present a novel architecture for a floating point division unit based on the Taylor-series
expansion algorithm. We show that the Iterative Logarithmic Multiplier is very well suited to be
used as a part of this architecture. We propose an implementation of the powering unit that can
calculate an odd power and an even power of a number simultaneously, meanwhile having little
hardware overhead when compared to the Iterative Logarithmic Multiplier.
Improve the Performance of Clustering Using Combination of Multiple Clusterin...ijdmtaiir
The ever-increasing availability of textual
documents has lead to a growing challenge for information
systems to effectively manage and retrieve the information
comprised in large collections of texts according to the user’s
information needs. There is no clustering method that can
adequately handle all sorts of cluster structures and properties
(e.g. shape, size, overlapping, and density). Combining
multiple clustering methods is an approach to overcome the
deficiency of single algorithms and further enhance their
performances. A disadvantage of the cluster ensemble is the
highly computational load of combing the clustering results
especially for large and high dimensional datasets. In this paper
we propose a multiclustering algorithm , it is a combination of
Cooperative Hard-Fuzzy Clustering model based on
intermediate cooperation between the hard k-means (KM) and
fuzzy c-means (FCM) to produce better intermediate clusters
and ant colony algorithm. This proposed method gives better
result than individual clusters.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
k-Means is a rather simple but well known algorithms for grouping objects, clustering. Again all objects need to be represented as a set of numerical features. In addition the user has to specify the number of groups (referred to as k) he wishes to identify. Each object can be thought of as being represented by some feature vector in an n dimensional space, n being the number of all features used to describe the objects to cluster. The algorithm then randomly chooses k points in that vector space, these point serve as the initial centers of the clusters. Afterwards all objects are each assigned to center they are closest to. Usually the distance measure is chosen by the user and determined by the learning task. After that, for each cluster a new center is computed by averaging the feature vectors of all objects assigned to it. The process of assigning objects and recomputing centers is repeated until the process converges. The algorithm can be proven to converge after a finite number of iterations. Several tweaks concerning distance measure, initial center choice and computation of new average centers have been explored, as well as the estimation of the number of clusters k. Yet the main principle always remains the same. In this project we will discuss about K-means clustering algorithm, implementation and its application to the problem of unsupervised learning
K-Means clustering uses an iterative procedure which is very much sensitive and dependent upon the initial centroids. The initial centroids in the k-means clustering are chosen randomly, and hence the clustering also changes with respect to the initial centroids. This paper tries to overcome this problem of random selection of centroids and hence change of clusters with a premeditated selection of initial centroids. We have used the iris, abalone and wine data sets to demonstrate that the proposed method of finding the initial centroids and using the centroids in k-means algorithm improves the clustering performance. The clustering also remains the same in every run as the initial centroids are not randomly selected but through premeditated method.
EXPERIMENTS ON HYPOTHESIS "FUZZY K-MEANS IS BETTER THAN K-MEANS FOR CLUSTERING"IJDKP
Clustering is one of the data mining techniques that have been around to discover business intelligence by grouping objects into clusters using a similarity measure. Clustering is an unsupervised learning process that has many utilities in real time applications in the fields of marketing, biology, libraries, insurance, city-planning, earthquake studies and document clustering. Latent trends and relationships among data objects can be unearthed using clustering algorithms. Many clustering algorithms came into existence. However, the quality of clusters has to be given paramount importance. The quality objective is to achieve
highest similarity between objects of same cluster and lowest similarity between objects of different clusters. In this context, we studied two widely used clustering algorithms such as the K-Means and Fuzzy K-Means. K-Means is an exclusive clustering algorithm while the Fuzzy K-Means is an overlapping clustering algorithm. In this paper we prove the hypothesis “Fuzzy K-Means is better than K-Means for Clustering” through both literature and empirical study. We built a prototype application to demonstrate the differences between the two clustering algorithms. The experiments are made on diabetes dataset
obtained from the UCI repository. The empirical results reveal that the performance of Fuzzy K-Means is better than that of K-means in terms of quality or accuracy of clusters. Thus, our empirical study proved the hypothesis “Fuzzy K-Means is better than K-Means for Clustering”.
Particle Swarm Optimization based K-Prototype Clustering Algorithm iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
Mine Blood Donors Information through Improved K-Means Clusteringijcsity
The number of accidents and health diseases which are increasing at an alarming rate are resulting in a huge increase in the demand for blood. There is a necessity for the organized analysis of the blood donor database or blood banks repositories. Clustering analysis is one of the data mining applications and K-means clustering algorithm is the fundamental algorithm for modern clustering techniques. K-means clustering algorithm is traditional approach and iterative algorithm. At every iteration, it attempts to find the distance from the centroid of each cluster to each and every data point. This paper gives the improvement to the original k-means algorithm by improving the initial centroids with distribution of data. Results and discussions show that improved K-means algorithm produces accurate clusters in less computation time to find the donors information
Max stable set problem to found the initial centroids in clustering problemnooriasukmaningtyas
In this paper, we propose a new approach to solve the document-clustering using the K-Means algorithm. The latter is sensitive to the random selection of the k cluster centroids in the initialization phase. To evaluate the quality of K-Means clustering we propose to model the text document clustering problem as the max stable set problem (MSSP) and use continuous Hopfield network to solve the MSSP problem to have initial centroids. The idea is inspired by the fact that MSSP and clustering share the same principle, MSSP consists to find the largest set of nodes completely disconnected in a graph, and in clustering, all objects are divided into disjoint clusters. Simulation results demonstrate that the proposed K-Means improved by MSSP (KM_MSSP) is efficient of large data sets, is much optimized in terms of time, and provides better quality of clustering than other methods.
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...cscpconf
Classification is a step by step practice for allocating a given piece of input into any of the given
category. Classification is an essential Machine Learning technique. There are many
classification problem occurs in different application areas and need to be solved. Different
types are classification algorithms like memory-based, tree-based, rule-based, etc are widely
used. This work studies the performance of different memory based classifiers for classification
of Multivariate data set from UCI machine learning repository using the open source machine
learning tool. A comparison of different memory based classifiers used and a practical
guideline for selecting the most suited algorithm for a classification is presented. Apart fromthat some empirical criteria for describing and evaluating the best classifiers are discussed
A Preference Model on Adaptive Affinity PropagationIJECEIAES
In recent years, two new data clustering algorithms have been proposed. One of them is Affinity Propagation (AP). AP is a new data clustering technique that use iterative message passing and consider all data points as potential exemplars. Two important inputs of AP are a similarity matrix (SM) of the data and the parameter ”preference” p. Although the original AP algorithm has shown much success in data clustering, it still suffer from one limitation: it is not easy to determine the value of the parameter ”preference” p which can result an optimal clustering solution. To resolve this limitation, we propose a new model of the parameter ”preference” p, i.e. it is modeled based on the similarity distribution. Having the SM and p, Modified Adaptive AP (MAAP) procedure is running. MAAP procedure means that we omit the adaptive p-scanning algorithm as in original Adaptive-AP (AAP) procedure. Experimental results on random non-partition and partition data sets show that (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset’s true labels with the execution times that are comparable with those original AP. Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Av33274282
1. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
274 | P a g e
A Survey On Seeds Affinity Propagation
Preeti Kashyap, Babita Ujjainiya
(Department of Information Technology, SATI College, RGPV University,Vidisha (M.P), India)
(Ass. Prof In Department of Information Technology, SATI College, RGPV University, Vidisha (M.P), India)
ABSTRACT
Affinity propagation (AP) is a clustering
method that can find data centers or clusters by
sending messages between pairs of data points.
Seed Affinity Propagation is a novel semi-
supervised text clustering algorithm which is
based on AP. AP algorithm couldn’t cope up with
part known data direct. Therefore, focusing on
this issue a semi-supervised scheme called
incremental affinity propagation clustering is
present in the paper where pre-known
information is represented by adjusting
similarity matrix The standard affinity
propagation clustering algorithm also suffers
from a limitation that it is hard to know the value
of the parameter “preference” which can yield an
optimal clustering solution. This limitation can
be overcome by a method named, adaptive
affinity propagation. The method first finds out
the range of “preference”, then searches the
space of “preference” to find a good value which
can optimize the clustering result.
Keywords- Affinity propagation, Clustering,
Incremental, Partition adaptive affinity propagation
and Text clustering.
I. INTRODUCTION
Clustering is a process of organizing
objects into groups whose members are similar in
some way. Cluster analysis seeks to partition a given
data set into groups based on specified features so
that the data points within a group are more similar
to each other than the points in different groups [1],
Therefore, a cluster is a collection of objects that are
similar among themselves and dissimilar to the
objects belonging to other clusters. Text Clustering
is to divide a set of text into cluster , so that text
within each cluster are similar in content.
Clustering is an area of research, finding its
applications in many fields. One of the most popular
clustering method is k-means clustering algorithm.
Arbitrarily k- points are generated as initial
centroids where k is a user specified parameter.
Each point is then assigned to the cluster with the
closest centroid then the centroid of each cluster is
updated by taking the mean of the data points of
each cluster. Some data points may move from one
cluster to other cluster. Again calculate new
centroids and assign the data points to the suitable
clusters. Repeat the assignment and update the
centroids, until convergence criteria are met. In this
algorithm mostly Euclidean distance is used to find
distance between data points and centroids. Standard
K-Means method has two limitations: (1) the
number of cluster needs to be specified first. (2) the
clustering result is sensitive to the initial cluster
centers .
Traditional approaches for clustering data
are based on metric similarities, i.e. symmetric,
nonnegative, and satisfying the triangle inequality
measures. More recent approaches, like Affinity
Propagation (AP) algorithm [2], can take as input
also general non metric similarities. In the domain
of image clustering, AP can use as input metric
selected segments of images pairs [3]. AP has been
used to solve a wide range of clustering problems
[4] and individual preferences predictions [5].The
clustering performance depends on the message
updating frequency and similarity measure AP has
been used in text clustering for its simplicity, good
performance and general applicability. By using AP
to preprocess text for text clustering. It was
combined with a parallel strategy for e-learning
resources clustering. But AP was used only as an
unsupervised algorithm and did not consider any
structural information derived from the specific
documents. For text mining tasks, the vector space
model (VSM), which treats a document as a bag of
words and uses plain language words as features [6].
This model can represent the text mining problems
directly and easily. With the increase of data set
size, the vector space becomes sparse, high
dimensional, and the computational complexity
grows exponentially. In many practical applications,
unsupervised learning is lacking relevant
information whereas supervised learning needs an
initial large number of class label information,
which requires time and expensive human labor. [7],
[8]. In recent years, semi-supervised learning has
captured a great deal of attentions [9], [10]. Semi-
supervised learning is a machine learning in which
the model is constructed using both labeled and
unlabeled data for training—typically a small
amount of labeled data and a large amount of
unlabeled data.
.
II . RELATED WORK
This section includes so far study on Affinity
propagation.
2. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
275 | P a g e
2.1 Affinity Propagation
AP, a new and powerful technique for
exemplar learning. It is a fast clustering algorithm
especially in the case of large number of clusters
and has some advantages: speed, general
applicability and good performance. In brief , AP
works based on similarities between pairs of data
points and simultaneously considers all the data
points as potential cluster centers called exemplar. It
computes two kinds of messages exchanged
between data points. The first one is called
―responsibility‖ and the second one is ―availability‖.
Affinity propagation takes as input a collection of
real-valued similarities between data points, where
the similarity s(i,k) indicates how well the data point
with index k is suited to be the exemplar for data
point i. When the goal is to minimize squared error,
each similarity is set to a negative squared error i.e
Euclidean distance: For point’s xi and xk,
s(i,k)=-||xi-xk||2
(1)
The two kinds of messages are exchanged
between data points, and each takes into account a
different kind of competition. Messages can be
combined at any stage to decide which points are
exemplars and, for every other point, which
exemplar it belongs to. The ―availability‖ a(i, k)
message is sent from candidate exemplar point j to
point i and it reflects the accumulated evidence for
how appropriate it would be for point i to choose
point j as its exemplar. The responsibility‖ r(i, k)
message is sent from data point i to candidate
exemplar point j and it reflects the accumulated
evidence for how well-suited point j is to serve as
the exemplar for point i. At the beginning, the
availabilities are initialized to zero: a(i ,j) = 0.Then,
the responsibilities are computed using the rule
r(i,k) ← s(i,k) ─ max{ a(i,k') + s(i,k')} (2)
The availabilities are zero in the first
iteration, r(i,k) is set to the input similarity between
point i and k as its exemplar, minus the largest of
the similarities between point i and other candidate
exemplars. This update is data-driven and does not
take into account how many other points favor each
candidate exemplar. In later iterations, when some
points are effectively assigned to other exemplars,
their availabilities will drop below zero. These
negative availabilities will decrease the effective
values of some of the input similarities s(i,k′) ,
removing the corresponding candidate exemplars
from competition. For k = i, the responsibility r(k,k)
is set to the input preference that point k be chosen
as an exemplar, s(k,k), minus the largest of the
similarities between point i and all other candidate
exemplars. This self-responsibility reflects that point
k is an exemplar.
a(i,k) ←min{0, r(k,k) + ∑ max{0, r(i' ,k)}} (3)
The availability a(i ,k) is set to the self responsibility
r(k,k) plus the sum of the positive responsibilities
candidate exemplar k receives from other points.
For a good exemplar only the positive portions of
incoming responsibilities are added to explain some
data points well regardless of how poorly it explains
other data points Negative self responsibility r(k,k)
indicates that point k is currently better suited as
belonging to another exemplar rather than being an
exemplar itself, the availability of point k as an
exemplar can be increased if some other points have
positive responsibilities for point k being their
exemplar. The total sum is thresholded to limit the
influence of strong incoming positive
responsibilities so that it cannot go above zero. The
―self-availability‖ a(k,k) is updated differently:
a(i,k) ← ∑ max { 0, r(i', k)} (4)
This message reflects that point k is an exemplar
sent to candidate exemplar k from other points. The
above update rules require only local and simple
computations that are easily implemented in eq. (3)
and messages need be exchanged between pairs of
points with known similarities Availabilities and
Responsibilities can be combined to identify
exemplars at any point during affinity propagation.
For point i, the value of k that maximizes a(i,k) +
r(i,k) either identifies point i as an exemplar if k = i,
or identifies the data point that is the exemplar for
point i. After changes in the messages fall below a
threshold , the message-passing procedure may be
terminated after a fixed number of iterations. To
avoid numerical oscillations that arise in some
circumstances, it is important that they be damped
when updating the messages. Each message is set to
l times its value from the previous iteration plus 1 –
λ times its prescribed updated value, where the
damping factor l is between 0 and 1. Each iteration
of affinity propagation consists of :
1. Updating all responsibilities given the
availabilities.
2. Updating all availabilities given the
responsibilities and
3. Combining availabilities and responsibilities to
monitor the exemplar decisions and terminate
the algorithm.
2.1.1 Disadvantages of Affinity Propagation
1.It is hard to know the value of the parameter
preferences which can yield an optimal clustering
solution.
2.When oscillations occur, AP cannot automatically
eliminate them.
2.2 Seeds Affinity Propagation
Seeds Affinity Propagation is based on AP
method. The main new features of the new
3. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
276 | P a g e
algorithm are: Tri-Set computation, similarity
computation, seeds construction, and messages
transmission [11] Similarity measurement are as
follows: Co-feature Set (CFS), Unilateral Feature
Set (UFS), and Significant Co-feature Set (SCS).
The structural information of the text documents is
included in the new similarity measurement.
2.2.1 Similarity Measurement
Similarity measurement plays a major role
in Affinity Propagation clustering. To give specific
and effective similarity measurement for our
particular domain, i.e., text document these three
feature sets are used . Co-feature Set, the Unilateral
Feature Set, and the Significant Co-feature Set.. In
this approach, each term in text is still deemed as a
feature and each document is still deemed as a
vector. All the features and vectors are not
computed simultaneously, but one at a time. Let D
= {d1 ,d2… dN} be a set of texts. Suppose, di and dj
are two objects in D and can be represented using
the following two subsets:
di ={<fi
1
,ni
1
>,<fi
2
,ni
2
>,….<fi
L
, ni
L
> },
dj = {<fj
1
,nj
1
>,<fj
2
,nj
2
>,…<fj
M
,nj
M
>},
where fi
x
and fj
y
(1≤ x ≤ L, 1 ≤ y ≤ M ) in the two
tuples <fix
,nj
x
>, <fj
y
, nj
y
> represent the xth and yth
feature of di and dj, respectively. ni
x
and nj
y
are the
values of fi
x
and fj
y
. L and M are the counts of the
objects features.
Let Fi and Fj be the feature sets of the two
objects, respectively: Fi = { fi
1
, fi
2
,…fi
L
}, Fj = { fj
1
,
fj
2
…..fj
M
}. Let the set DFj composed of the ―most
significant‖ features of dj. Most significant means
features that are capable of representing crucial
aspects of the document. These most significant
features could be key phrases and/or tags associated
with each document when available or they could be
all the words except stop words in the title of each
document. The venn diagram is shown in Fig1.
Definition 1. Co-feature Set.
Let di and dj be two objects in a data set. Suppose
that some features of di, also belong to dj. A new
two-tuples subset consisting of these features and
their values in dj can be constructed.
Definition 2. Unilateral Feature Set.
Suppose that some features of di, do not belong to
dj. Two-tuples subset consisting of these features
and their values in di can be constructed.
Definition 3. Significant Co-feature Set.
Suppose that some features of di, also belong to the
most significant features of dj. A new two-tuples
subset consisting of these features and their values
as the most significant features in dj can be
constructed. Thus extending the generic definition
of the similarity measures based on the Cosine
coefficient by introducing the three new sets CFS
UFS and SCS namely, Tri-Set similarity.
This extended similarity measure can
reveal both the difference and the asymmetric nature
of similarities between documents. It is quite
effective in the application of Affinity Propagation
clustering for text documents, image processing and
so on, since it is capable of dealing asymmetric
problem. The combination of this new similarity
with conventional Affinity Propagation is names as
Tri-Set Affinity Propagation (AP(Tri-Set))
clustering algorithm.
Fig 1. Three kinds of relations between the two
feature subsets of di and dj . Fi and Fj are their
feature subsets. DFj is the most significant features
of dj. D is the whole data set.
2.2.2 Seed Construction
In semi-supervised clustering, the main
goal is to efficiently cluster a large number of
unlabeled objects starting from a relatively small
number of initial labeled objects. Given a few initial
labeled objects, it can be use to construct efficient
initial seeds for our Affinity Propagation clustering
4. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
277 | P a g e
algorithm. To avoid a blind search guarantee
precision for seeds and imbalance errors, a specific
seeds’ construction method is presented, namely
Mean Features Selection. Let NO
, ND
, NF
and FC
represent, respectively, the object number, the most
significant feature number, the feature number, and
feature set of cluster C in the labeled set. Suppose F
is the feature set and DF is the most significant
feature set of seed c .For example, DF of this
manuscript could be all the words except stop words
in the title, i.e.{Survey, Seeds , Affinity, and
Propagation}. Let fk € FC; fk € FC. Their values in
cluster c are nk and nk', the values of being the most
significant feature are nDK (0 ≤ nDK ≤ nk) and nDK (
0≤ nDK≤ nK) . The seeds construction method is
prescribed as:
1. iff nk ≥ ∑ nk , fk € F;
No
2. iff nDK ≥ ∑ nDK , fk' € DF.
No
This method can find out the
representative features in labeled objects quickly.
Seeds are made up of these features and their values
in different clusters. They should be more
representative and discriminative than the normal
objects. For seeds, their self-similarities are set to
+∞ to ensure that the seeds will be chosen as
exemplars and help the algorithm to get the exact
cluster number. The combination of this semi-
supervised strategy with classical similarity
measurement and conventional Affinity Propagation
is named as Seeds Affinity Propagation with
Cosine coefficient (SAP(CC)) clustering algorithm.
By introducing both seed construction method and
the new similarity measurement into conventional
AP, the definition of the complete ―Seeds Affinity
Propagation algorithm can be generated.
2.2.3 Seeds Affinity Propagation Algorithm
Based on the definitions of SCS, UFS and the
described seeds’ construction method, the SAP
algorithm is developed, with this sequence of steps:
1. Initialization: Let the data set D be an N (N >0)
terms superset where each term consists of a
sequence of two-tuples:
D = {<f1
1
,n1
1
>, <f1
2
, n1
2
>, ….<f1
M1
, n1
M1
>}…
{<fN
1
, nN
1
>, <fN
2
,nN
2
>…..<fN
MN
,nN
MN
>}
where Mx represents the count of the xth object’s
feature.
2. Seeds construction: Constructing seeds from a
few labeled objects according to Mean Features
Selection. Adding these new objects into the
data set D, and getting a new data set D’ which
contains N’ terms (N≤N' );
3. Tri-Set computation: Computing the Co-feature
set, Unilateral Feature Set and Significant Co-
feature Set between objects i and j.
4. Similarity computation: Computing the
similarities among objects in D
5. Self-Similarity computation: Computing the
self-similarities for each object in D’.
6. Initialize messages: Initializing the matrixes of
messages
r (i ,j )= s(i,j) - max {s(i,j')}, a(i,j)=0
7. Message matrix computation: Computing the
matrixes of messages.
8. Exemplar selection: Adding the two message
matrixes and searching the exemplar for each
object i which is the maximum of r (i, j) + a(i, j).
9. Updating the messages
10. Iterating steps 6, 7, and 8 until the exemplar
selection outcome stays constant for a number of
iterations or after a fixed number of iterations
end the algorithm.
To summarize, start with the definition of three new
relations between objects. Then, assign the three
feature sets with different weights and present a new
similarity measurement. Finally, a fast initial seeds
construction method is defined and detail the steps
of the new Seeds Affinity Propagation algorithm in
the general case.
2.3.4 Advantages of Seeds Affinity Propagation
1. Reduces the time complexity and improves the
accuracy.
2. Avoids Being random initialization and trapped
in local minimum.
3.1 Incremental Affinity Propagation
A semi-supervised scheme called
incremental affinity propagation clustering is
present in the paper.[12] . Incremental Affinity
Propagation integrates AP algorithm and semi-
supervised learning. The labeled data information is
coded into similarity matrix in some way. And the
incremental study is performed for amplifying the
prior knowledge. In the scheme, the pre-known
information is represented by adjusting similarity
matrix. An incremental study is applied to amplify
the prior knowledge. To examine the effectiveness
of this method, concentrate it to text clustering
problem. In this method, the known class
information is coded into the similarity matrix
initially. And then after running AP for a certain
number of iterations , the most convinced data are
put into the "labeled data set" and reset the
similarity matrix. This process is repeated until all
the data are labeled. Compared with the method in
[13], the dealing with constrained condition in this
scheme is soft and objective. Furthermore, the
introduction of incremental study amplifies pre-
knowledge about the target data set and therefore
leads to a more accurate result. Also, the parameter
5. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
278 | P a g e
of "preference" in the method is self-adaptive
mainly according to the target number of clusters.
Focused on text clustering problem, Cosine
coefficient method is used [14] to compute the
similarity between two different points (texts) in the
specific I-APC algorithm.
3.1.1 Incremental Affinity Propagation
Algorithm
In most cases of real clustering problems, a part of
data set, generally a minority of the whole, may
have been labeled in prior. To take advantage of the
valuable and may be expensive resource ,a semi-
supervised scheme based on AP which is called
Incremental Affinity Propagation Clustering (I-
APC) algorithm is presented.
The similarity between data point i and j .
s( i,j ), indicates how well the data point i is suited
to be the exemplars, for data point j. Then if both the
points are labeled into the same class, they are more
likely to support each other to be an exemplar.
Therefore, s(i,j) should be set much larger than
usual. On contrary, if the two points are labeled
differently, s(i. j) should be set much smaller.
In the method, incremental learning
technique is applied to amplify our pre-knowledge.
After each run of AP, the most convinced data point
is added to the labeled data set. For non-labeled data
point k, a score function score (k,i ) represents how
likely the point belongs to cluster i:
Score (k,i) = α.∑(a(k,j)+r(k,j)) +β .
∑(a(k,j)+r(k,j)) (5)
where L is the labeled data set, I the data set
contained the ith clustering according to the last time
results running by AP, and a and β the coefficients
of the so-called convinced and non-convinced items,
respectively. Here a > β > O. Then a convinced
function of point k can be defined according to the
score function:
conv(k) = min max score(k,m)
score (k,n) (6)
Therefore the most convinced data point is selected
by maximizing whose convinced function value
within the non-labeled data.
After the labeled data set is updated and the
similarity matrix is reset according to the new
labeled data. The effect of the labeled data decreases
when the labeling time increases. So if points i and j
both are in the labeled data set and at least one is a
new labeled, the s (i,j) is set to be a function of
program time t:
S(i,j)=─│Xi─Xj│(A+e-Bt
)1-2│sign(Ci-Cj)│
(7)
where
sign(x) = 1 if x > 0
0 if x = 0
-1 if x < 0
and B are two constants, t is the program time,
which could be considered as the iteration number
of AP running, and Ci and Cj are the labeled
numbers for Xi and Xj, respectively.
During the process, preference s(i,j ) is also
adjustable. After a run of AP, when the resulted
clustering number is larger than expected, the values
of preferences should be reduced as a whole, and
vice versa.The rule is as follows:
s(i,i) = s(i,i). 1 (8)
1+e-K'/K
where K is the expected clustering number and K'
the resulted clustering number, respectively.
About the responsibilities and availabilities, their
values are kept at the end of last run as the initial
values of the next time. It will speed up the
convergence of the algorithm.
In summary, the I-APC scheme can be described
as follows:
1. Initialize: including the initializations of labeled
data set L, the responsibility matrix and
availability matrix, similarities between different
points (according to Eqs, (1) and (7) and self-
similarities say, set as a common value as well.
2. If the size of L reaches a pre-given number P,
goto step 5, else run AP.
3. Select the most convinced data point to Eqs (5-
6)and then update L. Reset similarities between
different points ,according to Eq 7 and self-
similarities ,according to Eq. 8
4. Go to step 2.
5. Output results, end.
The flow chart of the I-APC scheme is shown in Fig
2.
6. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
279 | P a g e
N
Yes
Fig 2. Flowchart of I-APC scheme
3.1.2 Incremental Affinity Propagation For Text
Clustering
Text clustering is the process to divide a
whole set of texts into several groups according to
the similarities among all the texts. It is an important
research field of text mining and is often used for
benchmark for the new developed clustering
methods. For text clustering, each text is considered
as a point in the problem space. So the similarity
between two different points couldn't be set as
negative squared Euclidean distance any longer.
Several similarity measurements are
commonly used in information retrieval. The
simplest one is Simple matching coefficient method,
where the idea is to count the number of shared
terms. A most powerful method is Cosine
Coefficient. Therefore Cosine coefficient is used to
measure the point similarity in this method. During
the process, the similarity matrix should be reset
according to the new labeled data and s(i , j) is a
time-dependent function based on the original
distance.
3.1.3 Disadvantage of Incremental Affinity
Propagation
1. I-APC method costs more CPU time than AP.
2. The performance will be tempered with when
the incremental process is run too much. So the
selection of the threshold is important.
4.1 Adaptive Affinity Propagation
The affinity propagation clustering
algorithm suffers from one limitation that it is hard
to know the value of the parameter ―preference‖
which can yield an optimal clustering solution. This
limitation can be overcome by a method named,
adaptive affinity propagation.[15] The method first
finds out the range of ―preference‖, then searches
the space of ―preference‖ to find a good value which
can optimize the clustering result.
4.1.1 Computing the range of preferences
Affinity propagation tries to maximize the
net similarity [16]. Net similarity is a score for
explaining the data, and it represents how
appropriate the exemplars are. The score sums up all
similarities between data points and their exemplar
The similarity between exemplar to itself is the
preference of the exemplar. Affinity propagation
aims at maximizing Net Similarity and tests each
data point whether it is an exemplar. Therefore, the
method which is using for computing the range of
preferences can be developed just as shown in Fig
1.
Fig3: The procedure of computing the range of
preferences
The maximum preference ( pmax) in the
range is the value which clusters the N data points
Initialization of S , A and
L
|L| < P ?
Run AP
Select the most
convinced data into L
Update S( sij and skk )
Output result & End.
Step3: Compute the minimal value of
Preferences
Step 3.1 Compute the net similarity when
the number of clusters is 1:
dpsim1 = max {∑ s(i,j)}
Step 3.2 Compute the net similarity when
the number of clusters is 2:
dpsim2 = max{∑ max{s (i,k), s (j,k)}
Step3.3 Compute the minimal value of
Preferences:
Pmin = dpsim1 ─ dpsim2
Algorithm 1 Preference Range Computing
Input: s(i,k) : the similarity between point i
and point k ( i ≠ k )
Output: the maximal value and minimal value
of preferences; pmax , pmin
Step1. Initialize s(k, k) to zero:
s (k, k) = 0
Step2: Compute the maximal value of
Preferences:
Pmax = max {s (i , k )}
7. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
280 | P a g e
into N clusters, and this is equal to the maximum
similarity, since a preference lower than that would
make the object better to have the data point
associated with that maximum similarity assigned to
be a cluster member rather than an exemplar.
The derivation for pmin is similar to pmax.
Suppose that there is a particular preference p' such
that the optimal net similarity for one clusters (k=1)
and the optimal net similarity for two clusters (k=2)
are the same. Optimal net similarity of two clusters
can be obtained by searching through all possible
pairs of possible exemplars, and the value is
dpsim2+2* p'. If there is one cluster, the value of
optimal net similarity is dpsim1 + p'. The minimum
preference pmin leads to clustering the N data points
into one cluster. Since affinity propagation aims at
maximizing the net similarity, that is dpsim1 + p'≥
dpsim2+2*p', Then p'≤ dpsim1─dpsim2 . Pmin no
more than p' , therefore, the minimum value for
preferences is pmin = dpsim1─dpsim2.
4.1.2 Adaptive Affinity Propagation Clustering
After computing the range of preferences,
preferences space can be scan to find the optimal
clustering result. Different preferences will result
different cluster. Cluster validation techniques are
used to evaluate which clustering result is optimal
for the datasets. Preference step is very important to
scan the space adaptively. It is denoted as:
pstep = pmax ─ pmin
N * 0.1 * K+ 50
In order to sample the whole space, set The base of
scanning step as pmax ─ pmin
N
This fixed increasing step cannot meet the different
requirement of different cases such as more clusters
and less clusters. Because more-clusters case is
more sensitive than that of less-cluster case.
Therefore the adaptive step method similar to
Wang’s[17], an adaptive coefficient is more useful ,
q = 1
0.1 K+ 50
In this way, the value of step p with the count of
clusters (K) is set . When K is large, pstep will be
small and vice versa.
In this paper, global silhouettes index are validity
indices. Silhouettes is introduced by Rousseeuw
[18] as a general graphical aid for interpretation and
validation of cluster analysis, which provides a
measure data point classification when it is assigned
to a cluster in according to both the tightness of the
clusters and the separation between them. Global
silhouette index is defined as follows:
GS = 1 ∑ Sj
nc
Where local silhouette index is:
Sj = 1 ∑ b(i)─ a(i)
rj max {b(i),a(i)}
Where rj the count of the objects in class j, a(i) is the
average distance between object i and the objects in
the same class j, is the minimum average distance
between object i and objects in class closet to class j.
Fig 2 shows the procedure of the adaptive
affinity propagation clustering method. The largest
global silhouette index indicates the best clustering
quality and the optimal number of clusters[19]. A
series of Sil values corresponding to clustering result
with different number of cluster are calculated. The
optimal clustering result is found when Sil is
largest.
Fig4; The procedure of adaptive propagation
clustering
4.1.3 Adaptive Affinity Propagation Document
Clustering
This section discusses the adaptive affinity
document clustering, which implements the adaptive
affinity propagation algorithm in clustering
documents, combined with Vector Space Model
(VSM). Vector Space Model is the most common
model for representing document among many
models of document representation. In VSM, every
document is represented as a vector:
V(d) = (t1 , w1(d); t2, w2(d);…tm, wm (d)) , where t1
is the word item, w1(d) is the weight of t1 in in the
Algorithm 3 Adaptive affinity propagation
Clustering:
Input: s(i,k): the similarity between point i and
point k (i≠ k)
Output: the clustering result
Step1: Apply Preferences Range algorithm to
computing the range of preferences:
[ pmin , pmax ]
Step2: Initialize the preferences:
preference = pmin ─ pstep
Step3: Update the preferences:
step preference = preference + p
Step4: Apply Affinity Propagation algorithm to
generating K clusters
Step5: Terminate until Sil is largest.
8. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
281 | P a g e
document d. The most widely used weighting
scheme is Term Frequency with Inverse Document
Frequency (TF-IDF).
5.1 Partition Adaptive Affinity Propagation
Affinity propagation exhibits fast execution
speed and finds clusters with low error rate when
clustering sparsely related data but its values of
parameters are fixed. Partition adaptive affinity
propagation can automatically eliminate oscillations
and adjust the values of parameters when rerunning
affinity propagation procedure to yield optimal
clustering results, with high execution speed and
precision [20]
The premise is that both AP and AAP are a
message communication process between data
points in a dense matrix. The time spent is in direct
ratio to the number of iterations. During each
iteration of AP, each element r( i, k) of the
responsibility matrix must be calculated once and
each calculation must be applied to N-1 elements,
where N is the size of the input similarity matrix,
according to Eq.( 2).And each element of the
availability matrix can be calculated in the same
way. During an iteration of AAP, the convergent of
K is detected but the execution speed is still much
lower than AP.
Partition adaptive affinity propagation (PAAP).
This modified algorithm can eliminate oscillations
by using the method of AAP. Our adaptive
technique consists of two parts. One is called fine
adjustment, another is coarse adjustment. Fine
adjustment is used to decrease the values of
parameter "preference "slowly, and coarse
adjustment is used to rapidly decrease the values of
preference correspondingly. The original similarity
matrix is decomposed into sub-matrices to gain
higher execution speed [21] [22], when executing
our method. PAAP can yield optimal clustering
solutions on both dense and sparse datasets.
Assuming that Cmax is the expected maximal
number of clusters, Cmin is the expected minimal
number of clusters, and K (i) is the number of
clusters in the iteration, and maxits is the maximal
number of iterations. λstep and Pstep are the
adaptive factors as in AAP. The PAAP algorithm
goes as follows:
Algorithm PAAP algorithm:
1. Execute AP procedure, get the number of
clusters:K(i).
2. If K(i) ≤ K(i + 1), then go to step 4. Else, count
== 0, then go to step 3.
3. λ←λ + λstep , then go to step 1. If A > 0.85, then
p← p+ pstep , s(i,i) ←p, Else go to step 1.
4. If │Cmax - K(i) │ > CK, then
Astep == ─20 * │K(i) - Cmin │Go to step 6.
Else, delay 10 iterations and then go to step 5.
5. If K(i) ≤ K(i + 1), then count == cout + 1, Astep
== count *Pstep. Go to step 6 or Else, go to step
1.
6. p == p + Astep, then s(i, i) ← p.
7. If i == maxits or K(i) ~ Cmin, the algorithm
terminates. Else, go to step 1.
PAAP can find the true or better number of clusters
with high execution speed on dense or sparse
datasets , meanwhile, it can automatically detect the
number oscillations and eliminate them. This
verified that both acceleration technique and
partition technique are effective. If Kpart s and A
step (acceleration factor) is well chosen, the average
number of iteration can be reduced effectively.
5.1.2 Advantages of partition Adaptive affinity
propagation.
1. PAAP improved approach based on affinity
propagation. It can automatically escape from
the oscillation and adjust values of parameters λ
and p.
III. CONCLUSION
In this survey, various clustering
approaches and algorithms in document clustering
are described. . A new clustering algorithm which
combines Affinity Propagation with semi-
supervised learning ,i.e Seeds Affinity Propagation
algorithm is present. In comparison with the
classical clustering algorithm k-means, SAP not
only improves the accuracy and reduces the
computing complexity of text clustering and but also
effectively avoids being trapped in local minimum
and random initialization. Whereas Incremental
Affinity Propagation integrates AP algorithm and
semi-supervised learning. The labeled data
information is coded into similarity matrix. The
Adaptive Affinity Propagation algorithm first
computes the range of preferences, and then
searches the space to find the value of preference
which can generate the optimal clustering results
compare to AP approach which cannot yield optimal
clustering results because it sets preferences as the
median of the similarities.
The area of document clustering has many issues,
which need to be solved. We hope , the paper gives
interested readers a broad overview of the existing
techniques. As a future work, improvement over the
existing systems with better results which offer new
information representation capabilities with
different techniques can be attempted.
REFERENCES
[1] S. Deelers and S.
Auwatanamongkol,“Enhancing K-Means
Algorithm with Initial Cluster Centers
Derived from Data Partitioning along the
Data Axis with the Highest Variance,‖
9. Preeti Kashyap, Babita Ujjainiya / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.274-282
282 | P a g e
International Journal of Electrical and
Computer Engineering 2:4 2007
[2] B.J. Frey and D. Dueck, “Clustering by
Passing Messages between Data Points,”
Science, vol. 315, no. 5814, pp. 972-976,
Feb. 2007.
[3] B.J. Frey and D. Dueck, “Non-Metric
Affinity Propagation for Un-Supervised
Image Categorization,” Proc. 11th IEEE
Int’l Conf. Computer Vision (ICCV ’07),
pp. 1-8, Oct. 2007.
[4] L. Michele, Sumedha, and W. Martin,
“Clustering by Soft- Constraint Affinity
Propagation Applications to Gene-
Expression Data,” Bioinformatics, vol. 23,
no. 20, pp. 2708-2715, Sept. 2007.
[5] T.Y. Jiang and A. Tuzhilin, “Dynamic
Micro Targeting: Fitness- Based Approach
to Predicting Individual Preferences,”
Proc. Seventh IEEE Int’l Conf. Data
Mining (ICDM ’07), pp. 173-182, Oct.
2007.
[6] F. Sebastiani, ―Machine Learning in
Automated Text Categorization,” ACM
Computing Surveys, vol. 34, pp. 1-47,
2002.
[7] F. Wang and C.S. Zhang, “Label
Propagation through Linear
Neighbourhoods,” IEEE Trans. Knowledge
and Data Eng., vol. 20, no. 1, pp. 55-67,
Jan. 2008.
[8] Z.H. Zhou and M. Li, “Semi-Supervised
Regression with Co- Training Style
Algorithms,” IEEE Trans. Knowledge and
Data Eng., vol. 19, no. 11, pp. 1479-1493,
Aug. 2007.
[9] A. Blum and T. Mitchell, “Combining
Labeled and Unlabeled Data with Co-
Training,‖ Proc. 11th Ann. Conf.
Computational Learning Theory, pp. 92-
100, 1998.
[10] Z.H. Zhou, D.C. Zhan, and Q. Yang,
“Semi-Supervised Learning with Very Few
Labeled Training Examples,” Proc. 22nd
AAAI Conf. Artificial Intelligence, pp.
675-680, 2007
[11] Renchu Guan, Xiaohu Shi, Maurizio
Marchese, Chen Yang, and Yanchun Liang
―Text Clustering with Seeds Affinity
Propagtaion” IEEE Transactions on
Knowledge and data Engineering , VOL.
23, NO. 4, APRIL 2011
[12] H.F. Ma, X.H. Fan, and J. Chen, “An
Incremental Chinese Text Classification
Algorithm Based on Quick Clustering,”
Proc. 2008 Int’l Symp. Information
Processing (ISIP ’08), pp. 308- 312, May
2008.
[13] Y. Xiao, and J. Yu, "Semi-Supervised
Clustering Based on Affinity Propagation,
" Journal of Software, Vol. 19, No. 11,
November 2008, pp. 2803-2813.
[14] C. J. van Rijsbergen, Information Retrieval,
2nd edition, Butterworth, London, pp. 23-
28, 1979.
[15] K.J. Wang, J.Y. Zhang, D. Li, X.N. Zhang,
and T. Guo, “Adaptive Affinity
Propagation Clustering,” Acta Automatica
Sinica, vol. 33, no. 12, pp. 1242-1246, Dec.
2007
[16] FAQ of Affinity Propagation Clustering:
http://www.psi.toronto.edu/affinitypropagat
ion/faq.html
[17] K.J. Wang, J.Y.. Zhang, and D. Li.
"Adaptive Affinity Propagation
Clustering." Acta Automatic Sinica,
33(12): 1242-1246, 2007
[18] P.J. Rousseeuw, Silhouettes: "a graphical
aid to the interpretation and validation of
cluster analysis", Computational and
Applied Mathematics. (20),53-65, 1987
[19] S. Dudoit, J. Fridlyand. "A prediction-
based resampling method for estimating
the number of clusters in a dataset".
Genome Biology,3(7): 0036.1-0036.21,
2002
[20] Changyin Sun, Chenghong Wang, Su
Song,Yifan Wang “A Local Approach of
Adaptive Affinity Propagation Clustering
for Large Scale Data” Proceedings of
International Joint Conference on Neural
Networks, Atlanta, Georgia, USA, June 14-
19, 2009
[21] Guha, S., Rastogi, R., Shim, K., "CURE:
an efficient clustering algorithm for large
databases," Inf.Syst., 26(1): 35-58, 2001.
[22] Ding-yin Xia, Fei Wu, Xu-qing Zhang,
Yue-ting Zhuang, " Local and global
approaches of affinity propagation
clustering for large scale data," J Zhejiang
Univ SCi A, , pp.1373 1381,2008.
.