This document summarizes a research paper that evaluates cluster quality using a modified density subspace clustering approach. It discusses how density subspace clustering can be used to identify clusters in high-dimensional datasets by detecting density-connected clusters in all subspaces. The proposed approach uses a density subspace clustering algorithm to select attribute subsets and identify the best clusters. It then calculates intra-cluster and inter-cluster distances to evaluate cluster quality and compares the results to other clustering algorithms in terms of accuracy and runtime. Experimental results showed that the proposed method improves clustering quality and performs faster than existing techniques.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Classification of common clustering algorithm and techniques, e.g., hierarchical clustering, distance measures, K-means, Squared error, SOFM, Clustering large databases.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
It is a data mining technique used to place the data elements into their related groups. Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
A novel clustering algorithm CSHARP is presented for the purpose of finding clusters of arbitrary shapes and arbitrary densities in high dimensional feature spaces. It can be considered as a variation of the Shared Nearest Neighbor algorithm (SNN), in which each sample data point votes for the points in its k-nearest neighborhood. Sets of points sharing a common mutual nearest neighbor are considered as dense regions/ blocks. These blocks are the seeds from which clusters may grow up. Therefore, CSHARP is not a point-to-point clustering algorithm. Rather, it is a block-to-block clustering technique. Much of its advantages come from these facts: Noise points and outliers correspond to blocks of small sizes, and homogeneous blocks highly overlap. This technique is not prone to merge clusters of different densities or different homogeneity. The algorithm has been applied to a variety of low and high dimensional data sets with superior results over existing techniques such as DBScan, K-means, Chameleon, Mitosis and Spectral Clustering. The quality of its results as well as its time complexity, rank it at the front of these techniques.
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.
automatic classification in information retrievalBasma Gamal
automatic classification in information retrieval-automatic classification of documents
Chapter 3 from IR_VAN_Book
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN B.Sc., Ph.D., M.B.C.S.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
It is a data mining technique used to place the data elements into their related groups. Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
Ensemble based Distributed K-Modes ClusteringIJERD Editor
Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
Subspace clustering discovers the clusters embedded in multiple, overlapping subspaces of high
dimensional data. Many significant subspace clustering algorithms exist, each having different
characteristics caused by the use of different techniques, assumptions, heuristics used etc. A comprehensive
classification scheme is essential which will consider all such characteristics to divide subspace clustering
approaches in various families. The algorithms belonging to same family will satisfy common
characteristics. Such a categorization will help future developers to better understand the quality criteria to
be used and similar algorithms to be used to compare results with their proposed clustering algorithms. In
this paper, we first proposed the concept of SCAF (Subspace Clustering Algorithms’ Family).
Characteristics of SCAF will be based on the classes such as cluster orientation, overlap of dimensions etc.
As an illustration, we further provided a comprehensive, systematic description and comparison of few
significant algorithms belonging to “Axis parallel, overlapping, density based” SCAF.
Multilevel techniques for the clustering problemcsandit
Data Mining is concerned with the discovery of interesting patterns and knowledge in data
repositories. Cluster Analysis which belongs to the core methods of data mining is the process
of discovering homogeneous groups called clusters. Given a data-set and some measure of
similarity between data objects, the goal in most clustering algorithms is maximizing both the
homogeneity within each cluster and the heterogeneity between different clusters. In this work,
two multilevel algorithms for the clustering problem are introduced. The multilevel
paradigm suggests looking at the clustering problem as a hierarchical optimization process
going through different levels evolving from a coarse grain to fine grain strategy. The clustering
problem is solved by first reducing the problem level by level to a coarser problem where an
initial clustering is computed. The clustering of the coarser problem is mapped back level-bylevel
to obtain a better clustering of the original problem by refining the intermediate different
clustering obtained at various levels. A benchmark using a number of data sets collected from a
variety of domains is used to compare the effectiveness of the hierarchical approach against its
single-level counterpart.
Clustering Using Shared Reference Points Algorithm Based On a Sound Data ModelWaqas Tariq
A novel clustering algorithm CSHARP is presented for the purpose of finding clusters of arbitrary shapes and arbitrary densities in high dimensional feature spaces. It can be considered as a variation of the Shared Nearest Neighbor algorithm (SNN), in which each sample data point votes for the points in its k-nearest neighborhood. Sets of points sharing a common mutual nearest neighbor are considered as dense regions/ blocks. These blocks are the seeds from which clusters may grow up. Therefore, CSHARP is not a point-to-point clustering algorithm. Rather, it is a block-to-block clustering technique. Much of its advantages come from these facts: Noise points and outliers correspond to blocks of small sizes, and homogeneous blocks highly overlap. This technique is not prone to merge clusters of different densities or different homogeneity. The algorithm has been applied to a variety of low and high dimensional data sets with superior results over existing techniques such as DBScan, K-means, Chameleon, Mitosis and Spectral Clustering. The quality of its results as well as its time complexity, rank it at the front of these techniques.
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
Data mining is the process of using technology to identify patterns and prospects from large amount of information. In Data Mining, Clustering is an important research topic and wide range of unverified classification application. Clustering is technique which divides a data into meaningful groups. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In this paper, we present the comparison of different K-means clustering algorithms.
automatic classification in information retrievalBasma Gamal
automatic classification in information retrieval-automatic classification of documents
Chapter 3 from IR_VAN_Book
INFORMATION RETRIEVAL
C. J. van RIJSBERGEN B.Sc., Ph.D., M.B.C.S.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A SURVEY ON OPTIMIZATION APPROACHES TO TEXT DOCUMENT CLUSTERINGijcsa
Text Document Clustering is one of the fastest growing research areas because of availability of huge amount of information in an electronic form. There are several number of techniques launched for clustering documents in such a way that documents within a cluster have high intra-similarity and low inter-similarity to other clusters. Many document clustering algorithms provide localized search in effectively navigating, summarizing, and organizing information. A global optimal solution can be obtained by applying high-speed and high-quality optimization algorithms. The optimization technique performs a globalized search in the entire solution space. In this paper, a brief survey on optimization approaches to text document clustering is turned out.
Data mining is utilized to manage huge measure of information which are put in the data ware houses and databases, to discover required information and data. Numerous data mining systems have been proposed, for example, association rules, decision trees, neural systems, clustering, and so on. It has turned into the purpose of consideration from numerous years. A re-known amongst the available data mining strategies is clustering of the dataset. It is the most effective data mining method. It groups the dataset in number of clusters based on certain guidelines that are predefined. It is dependable to discover the connection between the distinctive characteristics of data.
In k-mean clustering algorithm, the function is being selected on the basis of the relevancy of the function for predicting the data and also the Euclidian distance between the centroid of any cluster and the data objects outside the cluster is being computed for the clustering the data points. In this work, author enhanced the Euclidian distance formula to increase the cluster quality.
The problem of accuracy and redundancy of the dissimilar points in the clusters remains in the improved k-means for which new enhanced approach is been proposed which uses the similarity function for checking the similarity level of the point before including it to the cluster.
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
Clustering is an important step in the process of data analysis with applications to numerous fields. Clustering ensembles, has emerged as a powerful technique for combining different clustering results to obtain a quality cluster. Existing clustering aggregation algorithms are applied directly to large number of data points. The algorithms are inefficient if the number of data points is large. This project defines an efficient approach for clustering aggregation based on data fragments. In fragment-based approach, a data fragment is any subset of the data. To increase the efficiency of the proposed approach, the clustering aggregation can be performed directly on data fragments under comparison measure and normalized mutual information measures for clustering aggregation, enhanced clustering aggregation algorithms are described. To show the minimal computational complexity. (Agglomerative, Furthest, and Local Search); nevertheless, which increases the accuracy.
UNIT - 4: Data Warehousing and Data MiningNandakumar P
UNIT-IV
Cluster Analysis: Types of Data in Cluster Analysis – A Categorization of Major Clustering Methods – Partitioning Methods – Hierarchical methods – Density, Based Methods – Grid, Based Methods – Model, Based Clustering Methods – Clustering High, Dimensional Data – Constraint, Based Cluster Analysis – Outlier Analysis.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A Survey on Constellation Based Attribute Selection Method for High Dimension...IJERA Editor
Attribute Selection is an important topic in Data Mining, because it is the effective way for reducing dimensionality, removing irrelevant data, removing redundant data, & increasing accuracy of the data. It is the process of identifying a subset of the most useful attributes that produces compatible results as the original entire set of attribute. Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group called a cluster are more similar in some sense or another to each other than to those in other groups (Clusters). There are various approaches & techniques for attribute subset selection namely Wrapper approach, Filter Approach, Relief Algorithm, Distributional clustering etc. But each of one having some disadvantages like unable to handle large volumes of data, computational complexity, accuracy is not guaranteed, difficult to evaluate and redundancy detection etc. To get the upper hand on some of these issues in attribute selection this paper proposes a technique that aims to design an effective clustering based attribute selection method for high dimensional data. Initially, attributes are divided into clusters by using graph-based clustering method like minimum spanning tree (MST). In the second step, the most representative attribute that is strongly related to target classes is selected from each cluster to form a subset of attributes. The purpose is to increase the level of accuracy, reduce dimensionality; shorter training time and improves generalization by reducing over fitting.
1. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1627-1633
Evaluating Cluster Quality Using Modified Density Subspace
Clustering Approach
V.Kavitha1, Dr.M.Punithavalli2,
1
Research Scholar, Dept of Computer Science, Karpagam University Coimbatore, India.
2
Director,Dept of MCA, Sri Ramakrishna College of Engineering Coimbatore, India.
Abstract
Clustering of the time series data faced Identifying the groups of similar objects and helps to
with curse of dimensionality, where real world discover distribution of patterns, finding the
data consist of many dimensions. Finding the interesting correlations in large data sets. It has been
clusters in feature space, subspace clustering is a subject of wide research since it arises in many
growing task. Density based approach to identify application domains in engineering, business and
clusters in dimensional point sets. Density social sciences. Especially, in the last years the
subspace clustering is a method to detect the availability of huge transactional and experimental
density-connected clusters in all subspaces of high data sets and the arising requirements for data
dimensional data for clustering time series data mining created needs for clustering algorithms that
streams Multidimensional data clustering scale and can be applied in diverse domains.
evaluation can be done through a density-based
approach. In this approach proposed, Density The clustering of the times series data
subspace clustering algorithm is used to find best streams becomes one of the important problem in
cluster result from the dataset. Density subspace data mining domain. Most of the traditional
clustering algorithm selects the P set of attributes algorithms will not support the fast arrival of large
from the dataset. Then apply the density amount of data stream. In several traditional
clustering for selected attributes from the algorithms, a few one-pass clustering algorithms
dataset .From the resultant cluster calculate the have been proposed for the data stream problem.
intra and inter cluster distance. Measuring the These methods address the scalability issues of the
similarity between data objects in sparse and clustering problem and evolving the data in the
high-dimensional data in the dataset, Plays a very result and do not address the following issues:
important role in the success or failure of a
clustering method. Evaluate the similarity (1)The quality of the clusters is poor when the data
between data points and consequently formulate evolves considerably over time.
new criterion functions for clustering .Improve
the accuracy and evaluate the similarity between (2) A data stream clustering algorithm requires much
the data points in the clustering,. The proposed greater functionality to discovering accurate result
algorithm also concentrates the Density and exploring clusters over different portions of the
Divergence Problem (Outlier). Proposed system stream.
clustering results compared them with existing Charu C. Aggarwal [5] proposed micro-
clustering algorithms in terms of the Execution clustering phase in the online statistical data
time, Cluster Quality analysis. Experimental collection portion of the algorithm. This method is
results show that proposed system improves not dependent on any user input such as the time
clustering quality result, and less time than the horizon or the required granularity of the clustering
existing clustering methods. process. The aim of the method is to maintain
statistics based on sufficiently high level granularity
Keywords— Density Subspace Clustering, Intra used by the online components such as horizon-
Cluster, Inter Cluster, Outlier, Hierarchical specific macro-clustering as well as evolution
Clustering. analysis.
I. INTRODUCTION The clustering of the time series data
Clustering is one of the most important stream and incremental models are requiring a
tasks in data mining process for discovering similar decisions before all the data are available in the
groups and identifying interesting distributions and dataset. The models are not identical to find the best
patterns in the underlying data. Clustering of the clustering result.
time series DataStream problem is about partitioning Finding clusters in the feature space,
a given data set into groups such that the data points subspace clustering is an emergent task. Clustering
in a cluster are more similar to each other than points with dissimilarity measure is robust method to
in different clusters .Cluster analysis [3] aims at handle large amount of data and able to estimate the
1627 | P a g e
2. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1627-1633
number of clusters automatically by avoid overlap. hierarchical algorithms are classified as
Density subspace clustering is a method to detect the agglomerative (merging) or divisive (splitting). The
density-connected clusters in all subspaces of high agglomerative approach starts with each data point
dimensional data. In our proposed approach Density in a separate cluster or with a certain large number
subspace clustering algorithm is used to find best of clusters. Each step of this approach merges the
cluster result from the dataset. Density subspace two clusters that are the most similar. Thus after
clustering algorithm selects the P set of attributes each step, the total number of clusters decreases.
from the dataset. Then apply the density clustering This is repeated until the desired number of clusters
for selected attributes from the dataset .From the is obtained or only one cluster remains. By contrast,
resultant cluster calculate the intra and inter cluster the divisive approach starts with all data objects in
distance. In this method finds the best cluster the same cluster. In each step, one cluster is split into
distance, repeat the steps until all the attributes in the smaller clusters, until a termination condition holds.
dataset are clustered among the clustered result in Agglomerative algorithms are more widely used in
the dataset and finally finds the best cluster result. practice. Thus the similarities between clusters are
more researched.
Clustering methods are used to support
estimates a data distribution for newly attracted data Hierarchical divisive methods generate a
and their ability to generate cluster boundaries of classification in a top-down manner, by
arbitrary shape, size and efficiently. Density based progressively sub-dividing the single cluster which
clustering for measuring dynamic dissimilarity represents an entire dataset. Monothetic (divisions
measure based on the dynamical system was based on just a single descriptor) hierarchical
associated with Density estimating functions. divisive methods are generally much faster in
Hypothetical basics of the system proposed measure operation than the corresponding polythetic
are developed and applied to construct a different (divisions based on all descriptors) hierarchical
clustering method that can efficiently partition of the divisive and hierarchical agglomerative methods, but
whole data space in the dataset. Clustering based on tend to give poor results. Given a set of N items to
the Density based clustering dissimilarity measure is be clustered, and an N*N distance (or similarity)
robust to handle large amount of data in the dataset matrix, the basic process of hierarchical clustering
and able to estimate the number of clusters 1. Start by assigning each item to a cluster, so
automatically by avoid overlap. The dissimilarity that if you have N items, you now have N
values are evaluated and clustering process is carried clusters, each containing just one item. Let
out with the density values. the distances (similarities) between the
clusters the same as the distances
Similarity measures that take into (similarities) between the items they
consideration on the context of the features have also contain.
been employed but refer to continuous data, e.g., 2. Find the closest (most similar) pair of
Mahalanobis distance. Dino Ienco proposed context- clusters and merge them into a single
based distance for categorical attributes. The cluster, so that now you have one cluster
motivation of this work is to measure the distance less.
between two values of a categorical attribute Ai can 3. Compute distances (similarities) between
be determined by which the values of the other the new cluster and each of the old clusters.
attributes Aj are distributed in the dataset objects: if 4. Repeat steps 2 and 3 until all items are
they are similarly of the attributes distributed in the clustered into a single cluster of size N. (*)
groups of data objects in correspondence of the
distinct values of Ai a low value of distance was An easy way to comply with the conference
obtained. Author also Propose also a solution to the paper formatting requirements is to use this
critical point choice of the attributes Aj .The result document as a template and simply type your text
was validated with various real world and synthetic into it.
datasets, by embedding our distance learning method
in both partitional and a hierarchical clustering B. Non Hierarchical Clustering
algorithm. A non-hierarchical method generates a
classification by partitioning a dataset, giving a set
II. RELATED WORK of (generally) non-overlapping groups having no
This work is based on hierarchical approach. hierarchical relationships between them. A
So, the process is incremental clustering process. systematic evaluation of all possible partitions is
quite infeasible, and many different heuristics have
A. Hierarchical Clustering thus been described to allow the identification of
A hierarchical clustering algorithm creates good, but possibly sub-optimal, partitions. Non-
a hierarchical decomposition of the given set of data hierarchical methods are generally much less
objects. Depending on the decomposition approach, demanding of computational resources than the
1628 | P a g e
3. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1627-1633
hierarchic methods, only a single partition of the system splits the cluster and assigns each of the
dataset has to be formed. chosen variables to one of the new clusters,
Three of the main categories of non-hierarchical becoming this pivot variable for that cluster.
method are single-pass, relocation and nearest Afterwards, all remaining variables on the old
neighbour: cluster are assigned to the new cluster which has the
Single-pass methods (e.g. Leader) produce closest pivot. New leaves start new statistics,
clusters that are dependent upon the order assuming that only forthcoming information will be
in which the compounds are processed, and useful to decide whether or not this cluster should be
so will not be considered further; split. This feature increases the system's ability to
Relocation methods, such as k-means, cope with changing concepts as, later on, a test is
assigns compounds to a user-defined performed such that if the diameters of the children
number of seed clusters and then iteratively leaves approach the parent's diameter, then the
reassign compounds to see if better clusters previously taken decision may no longer recent the
result. Such methods are prone to reaching structure of data, so the system re-aggregates the
local optima rather than a global optimum, leaves on the parent node, restarting statistics. We
and it is generally not possible to determine propose our work to solve the density based
when or whether the global optimum subspace system comparing to the different densities
solution has been reached; at each of the subspace attributes system, each time
Nearest neighbor methods, such as the series data streams.
Jarvis-Patrick method, assign compounds to
the same cluster as some number of their B .Disadvantages of the existing system
nearest neighbors. User-defined parameters The split decision used in the algorithm
determine how many nearest neighbors focus only focus on measuring the distance
need to be considered, and the necessary between the two groups, which implies
level of similarity between nearest neighbor high risk to solve the density problems at
lists. different densities.
The different density at sub attribute values
III. METHODOLOGY is changes to both intra and inter cluster.
In Methodology, we discussed about the
existing system ODAC and Proposed System of C.IHCA (Improved Hierarchical Clustering
IHCA and MDSC algorithms. Algorithm)
The Improved Hierarchical Clustering
A. Online Divisive Agglomerative Clustering algorithm [IHCA] is an algorithm for an incremental
The Online Divisive-Agglomerative clustering of streaming time sequence. It constructs a
Clustering (ODAC) is an incremental approach for hierarchical tree-shaped structure of clusters by
clustering streaming time series using a hierarchical using a top-down strategy. The leaves are the
procedure over time. It constructs a tree-like resulting clusters, with each leaf grouping a set of
hierarchy of clusters of streams, using a top-down variables. The system includes an incremental
strategy based on the correlation between streams. distance measure and executes procedures for
The system also possesses an agglomerative phase to expansion and aggregation of the tree based
enhance a dynamic behaviour capable of structural structure. The system will be monitoring the flow of
change detection. The ODAC (Online Divisive- continuous time series data. Then time interval will
Agglomerative Clustering) system is a variable be fixed. Within the specific time interval the data
clustering algorithm that constructs a hierarchical points will be partitioned. In a partition the diameter
tree-shaped structure of clusters using a top-down is calculated. Diameter is nothing but the maximum
strategy. The leaves are the resulting clusters, with a distance between the two points. Each and every
set of variables at each leaf. The union of the leaves data point of the partition will be compare with the
is the complete set of variables. The intersection of diameter value. If the data point is greater than the
leaves is the empty set. The system encloses an diameter value then the split process will be execute
incremental distance measure and executes otherwise the Aggregate (Merge) process will be
procedures for expansion and aggregation of the performed. Based on the above criteria the
tree-based structure, based on the diameters of the hierarchical tree will be growing. Here we have to
clusters. The main setting of our system is the observe the splitting process, because the splitting
monitoring of existing cluster's diameters. In a will decide the growth of clusters. In the proposed
divisive hierarchical structure of clusters, technique the Hoeffding Bound is used for to
considering stationary data streams, the overall intra- observe the splitting process. In IHCA the technique
cluster dissimilarity should decrease with each split. unequality vapnik Chervonenkis is used for splitting
For each existing cluster, the system finds the two process. Using this technique the observation of
variables defining the diameter of that cluster. If a splitting process is improved. So, the cluster is
given heuristic condition is met on this diameter, the grouping properly.
1629 | P a g e
4. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1627-1633
Set minPts
In the Hoeffding Bound,
If (q
{
C=0
(1) best distance =0
i=1…n
Where, the observations starting that after n
independent observations of the real valued random For each unvisited point P in dataset D
mark P as visited
variable r with range R, with confidence
N = getNeighbours (P, eps)
In the proposed algorithm, the range value if sizeof(N) < MinPts
mark P as NOISE
will be increase from R2 [1]to RN .[2] So the
else
observation process is not a fixed one. Depends on
C = next cluster
the number of nodes the system will generating the
expandCluster (P, N, C, eps, MinPts)
observation process.
Calculate the average distance cluster
D. Modified Density Subspace Clustering
Instead of finding clusters in the full feature
space, subspace clustering is an emergent task which
denotes the jth datapoint which belongs to cluster
aims at detecting clusters embedded in subspaces. A
i.
cluster is based on a high-density region in a
Nc stands for number of clusters.
subspace. To identify the dense region is a major
is the distance between datapoint and
problem. And some of the data points are forming
out of the cluster range is called ―the density the cluster centroid .
divergence problem‖. We propose a novel Modified xi stands for datapoint which belongs to cluster
subspace clustering algorithm is to discover the centroid
clusters based on the relative region densities in the best distance =0
subspaces attribute, where the clusters are regarded for each value as
as regions whose densities are relatively high as if (fi >bestdistance)
compared to the region densities in a subspace. {
Based on this idea, different density thresholds are bestdistance=fi
adaptively determined to discover the clusters in selected attribute set = from best distance fi
different subspace attribute. We also devise an Hierarchical _clustering ( )
innovative algorithm, referred to as MDSC }
(Modified Density Subspace clustering), to adopt a d=d+1
divide-and-conquer scheme to efficiently discover }
clusters satisfying different density thresholds in End while
different subspace cardinalities. }
expandCluster (P, N, C, eps, MinPts)
Advantages of the proposed system {
The proposed system efficiently discovers add P to cluster C
clusters satisfying different density thresholds for each point P' in N
in different subspace attributes. if P' is not visited
To reduce the Density Divergence Problem. mark P' as visited
To reduce the outlier. (The data points which N' = getNeighbours(P', eps)
are out of the range). if sizeof(N') >= MinPts
To improve the Intra cluster and Inter cluster N = N joined with N'
performance. if P' is not yet member of any cluster
add P' to cluster C
E. Algorithm for Modified Density Subspace Return the cluster C from the datapoint P.
Clustering }
1. Initialize the selected attribute set and An –total Hierarchical _clustering ( )
attribute set X = {x1, x2, x3... xn} be the set of data points.
2. Select a set of attribute subset from at dataset I. Begin with the disjoint clustering having level L(0)
d=1…n, = 0 and sequence number m = 0.
While ( ==! null II. Find the least distance pair of clusters in the
current clustering, say pair (Ci), (Cj), according to
{
d[(Ci), (Cj),] = min d[(i),(j)] =best distance where
Set (eps)
1630 | P a g e
5. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1627-1633
the minimum is over all pairs of clusters in the D. Outlier
current clustering. Outlier is nothing but, the data points which
III. Increment the sequence number: are out of the range of the cluster. The outlier is
m = m +1.Merge clusters Ci and Cj into a single calculated for the existing method of ODAC(Online
cluster to form the next clustering m. Set the level Divisive Agglomerative Clustering) and the
of this clustering to L(m) = d[(Ci), (Cj),] proposed method MDSC (Modified Density
IV. Update the distance matrix, D, by Subspace Clustering).
deleting the rows and columns corresponding to
clusters (r) and (s) and adding a row and Outlier Calculation
column corresponding to the newly formed cluster. Step 1: Intra Cluster value is calculated for all
The distance between the new cluster, denoted (Ci), Clusters.
(Cj) and old cluster (Ck) is defined in this way: d Step 2: Mean of the Intra cluster is found out.
[(Ck), (Ci), (Cj)] = min [ d(k,i) ,d(k,j) ] Step 3: All the data points of the clusters will be
V. If all the data points are in one cluster then stop, comparing with the mean value.
else repeat from step 2. Step 4: After comparison, each data point will be
decided whether the point will position within a
3. End cluster or out of the cluster.
IV. EXPERIMENTAL RESULTS V.SYSTEM EVALUATION ON TIME
The main objective of this chapter to SERIES DATA SET
measure the proposed system result with the existing This proposed method is evaluated with
system. Measuring the performance of cluster results different kinds of time series data sets. Three types
and cluster analysis was measured in terms of the of data sets are used to evaluate the proposed
Cluster Quality (Intra Cluster and Inter Cluster) and algorithm. The data sets are namely ECG Data, EEG
Computation Time. The proposed system is very Data and Network Sensor Data. ECG Data set is
much adapted with the dynamic performance of time used to find out the anomaly Identification. This data
series data stream. We must evaluate our proposed set have three attributes namely time seconds, left
system with real data produced by applications that peek and right peek. EEG Data set is used to find out
generate time series data streams. abnormal personality. The name of the attributes is
Trial number, Sensor value, Sensor position, Sample
A. Evaluation Criteria for Clustering Quality number. The third type of data set is Network sensor.
Generally, the criteria used to evaluate The name of he attributes is Total bytes, in bytes, out
clustering methods concentrate on the quality of the bytes, Total Package, in package, out package,
resulting clusters. Given the hierarchical Events.
characteristics of the system, the quality of the
hierarchy is constructed by our algorithm. And A. Record Set Specification
another evaluation criterion is computation time of TABLE I DATA SET SPECIFICATION
the system. Data Set Number of Number of
Instance Attributes
B. Cluster Quality ECG 1800 3
A good clustering algorithm will produce EEG 1644 4
high quality based on intra cluster similarity and
Sensor 2500 7
inter cluster similarity measures. The quality of the
Network
clustering result depends on the similarity measure
Using the above three kinds of data sets we have
used by the method and its implementation. The
to calculate Execution time of the system, Intra
quality of a clustering method is also measured by
cluster , Inter cluster and outlier of the cluster.
its ability to discover some or all of the hidden
patterns. The criteria for measuring the cluster
B. Result of Outlier
quality of intra clusters similarity will be high. And
TABLE2 OUTLIER SPECIFICATION
the inter cluster similarity will be low. For analysing
cluster quality will be in two forms, First one is
finding groups of objects will be related to one Technique Outlier Points
another. And second one is finding the group of Existing 152
objects that differ from the objects in other groups. System(ODAC)
Existing 123
C. Computation Time System(IHCA)
Another evaluation of this work is Proposed 63
calculating the computation time of the process. The System(MDSC)
complexity of execution time will be decreased
when using the proposed work.
1631 | P a g e
6. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1627-1633
C.Result of Execution Time FIGURE 2
The following table shows that the difference INTRA CLUSTER BETWEEN EXISIING AND
between the two techniques of the system execution PROPOSED SYSTEM
time.
TABLE 3
EXECUTION TIME BETWEEN EXISIING AND
PROPOSED
Existing System
No of Time in seconds Proposed System Time in seconds
Clusters ODAC IHCA MDSC
2 2.0343 1.9066 1.3782
4 2.0493 1.9216 1.3664
6 2.1043 1.9766 1.3641
TABLE 5
8 2.1115 1.9838 1.3901 INTER CLUSTER BETWEEN EXISIING AND
10 2.0536 1.9259 1.3251 PROPOSED SYSTEM
FIGURE 1
EXECUTION TIME BETWEEN THE EXISTING Existing
AND PROPOSED SYSTEMS System Inter
Cluster Proposed System Inter Cluster
No of
Clusters ODAC IHCA MDSC
2 295.64 375.84 393.26
4 262.72 279.07 296.42
6 233.27 204.84 222.26
8 156.65 148.74 166.16
10 136.54 119.89 137.31
FIGURE 3
INTER CLUSTER BETWEEN EXISIING AND
PROPOSED SYSTEM
TABLE 4
INTRA CLUSTER BETWEEN EXISIING
AND PROPOSED SYSTEM
Existing System
No of Intra Cluster Proposed System Intra Cluster
Clusters ODAC IHCA MDSC
2 898.94 865.15 831.72
4 448.87 415.63 382.21
6 346.56 313.41 279.98 VI.CONCLUSION AND FUTURE WORK
8 271.54 238.54 205.11 Clustering of the time series data faced with
curse of dimensionality, where real world data
10 199.04 166.04 132.61 consist of many dimensions. Finding the clusters in
feature space, subspace clustering is an growing task.
1632 | P a g e
7. V.Kavitha, Dr.M.Punithavalli / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 1, January -February 2013, pp.1627-1633
Density based approach to identify clusters in Simoudis, J. Han, and U. Fayyad, Eds.
dimensional point sets. Density subspace clustering AAAI Press, 1996, pp. 226–231.
is a method to detect the density connected clusters [3] Y. Kim, W. Street, and F. Menczer,
in all subspaces of high dimensional data for ―Feature Selection in Unsupervised
clustering time series data streams Multidimensional Learning via Evolutionary Search,‖ Proc.
data clustering evaluation can be done through a Sixth ACM SIGKDD Int’l Conf.
density-based approach. The Modified Density Knowledge Discovery and Data Mining, pp.
subspace clustering algorithm is used to find best 365-369, 2000.
cluster result from the dataset. Improve the Cluster [4] M. Halkidi, Y. Batistakis, and M.
Quality and evaluate the similarity between the data Varzirgiannis, ―On clustering validation
points in the clustering, The MDSC algorithm also techniques,‖ Journal of Intelligent
concentrates the Density Divergence Problem Information Systems, vol. 17, no. 2-3, pp.
(Outlier). Proposed system clustering results 107–145, 2001.
compared them with existing clustering algorithms [5]. Pedro Pereira Rodriguess and Joao Pedro
in terms of the Execution time, Cluster Quality Pedroso, ―Hierarchical Clustering of Time
analysis. Experimental results show that proposed Series Data Streams,‖ Sudipto Guha, Adam
system improves clustering quality result, and less Meyerson, Nine Mishra and Rajeev
time than the existing clustering methods. The Motiwani, ―Clustering Data Streams:
problem find out in the proposed work is, to Theory and Practice‖, IEEE Transactions
optimize the centroid point using Multi view point on Knowledge and Data Engineering. Vol.
approach. And to apply this technique in non time 15, no. 3, pp. 515-528, May/June 2003.
series data set also. [6] Ashish Singhal, and Dale E Seborg,
―Clustering Multivarriate Time Series
References Data,‖ Journal of Chemometrics, vol. 19,
[1] Yi-Hong Chu, Jen-Wei Huang, Kun-Ya pp. 427-438, Jan 2006.
Chuang, DeNian Yang, ―Density Conscious [7] Sudipto Guha, Adam Meyerson, Nine
Subspace Clustering for High Dimensional Mishra and Rajeev Motiwani, ―Clustering
Data‖ IEEE Transactions on Knowledge Data Streams: Theory and Practice‖, IEEE
and Data Engineering. Vol 22, No 1, Transactions on Knowledge and Data
January 2010. Engineering. Vol. 15, no. 3, pp. 515-528,
[2] M. Ester, H.-P. Kriegel, J. Sander, and X. May/June 2003.
Xu, ―A density-based algorithm for [8] Ashish Singhal, and Dale E Seborg,
discovering clusters in large spatial ―Clustering Multivarriate Time Series
databases with noise,‖ in Proceedings of the Data,‖ Journal of Chemometrics, vol. 19,
Second International Conference on pp. 427-438, Jan 2006.
Knowledge Discovery and Data Mining, E.
1633 | P a g e