SlideShare a Scribd company logo
1 of 6
Download to read offline
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3637
Optimal Number of Cluster Identification using Robust K-means for
Sequences in Categorical Sequences
S.U. Patil1, U.A. Nuli2
1,2Computer Science and Engineering department, M. Tech, Textile and Engineering Institute, Ichalkaranji,
Maharashtra, India
---------------------------------------------------------------------***----------------------------------------------------------------------
Abstract - This paper presents a modified k-means
algorithm for clustering. In traditional method of
clustering number of clusterstobeformedwillbegivenat
the start of the algorithm, which affects performance and
efficiency of the algorithm. In Robust K-means for
sequences optimal number of cluster will be predicted by
removing noise cluster. Cluster validation, which is the
process of evaluating the quality of clustering results,
plays an important role for practical machine learning
systems. Categorical sequences, such as biological
sequences in computational biology, have become
common in real-world applications. Different from
previous studies, whichmainly focusedon attribute-value
data, in this paper, we work on the cluster validation
problem for categorical sequences. Clustering is defined
as an unsupervised learning where the objects are
grouped on the basis of some similarity inherent among
them. The intension of this paper is to describe the
clustering method which will give the optimal number of
clusters in categorical sequences.
Key Words: Clusterign ,K-means,clustervalidationindex,
categorical sequences, centroid.
1. INTRODUCTION
Data mining is a process of deriving required data from a
collection of large dataset and making analysis on collected
data. The data mining technique is to mine informationfrom
a bulky data set and make over it into a reasonable form for
supplementary purpose. In generic term this is known as
classification. In classification when the classes of an object
is given in advance is termed as supervised classification
where as the other case when the class label is not tagged to
an object in advance is termed as unsupervised
classification. The unsupervised classification is commonly
known as clustering. Clustering is important analysis
techniques that is employed to large datasets and finds its
application in the fields like search engines,
recommendation systems, data mining, knowledge
discovery, bioinformaticsanddocumentation.Nowadays, the
data being generated is not only huge in volume, but is also
stored across various machines all around the world. The
main purpose behind the study of classificationistodevelop
a tool or an algorithm, which can be used to predict the class
of an unknown object, which is not labeled.
Clustering problem cannot be solvedbyonespecific
algorithm but it requires various algorithms that differ
significantly in their notion of what makes a clusterandhow
to efficiently find them. Generally clusters include groups
with small distances among the cluster members, dense
areas of the data space, intervals or particular statistical
distributions. Clustering can therefore be formulated as a
multi-objective optimization problem. The appropriate
clustering algorithm and parameter settings depend on the
individual data set and intended useoftheresults.Clustering
is considered to be more difficult than supervised
classification as there is no label attached to the patterns in
clustering. The given label in the case of supervised
classification becomes a clue to grouping data objects as a
whole. Whereas in the case of clustering, it becomes difficult
to decide, to which group a pattern will belong to, in the
absence of a label.
The problem of clustering becomes more
challenging when the data is categorical, that is, when there
is no inherent distance measure between data values.Thisis
often the case in many domains, wheredata isdescribedbya
set of descriptive attributes, many of which are neither
numerical nor inherently ordered in any way. Moreover
clustering categorical sequences is a challenging problem
due to one more reason the difficulties in defining an
inherently meaningful measure of similarity between
sequences. Cluster validation, which is the process of
evaluating the quality of clustering results, plays an
important role for practical machine learning systems.
Categorical sequences, such as biological sequences in
computational biology, have become common in real-world
applications. Without a measure of distance between data
values, it is unclear how to define a quality measure for
categorical clustering. To do this, we employ mutual
information, a measure from information theory. A good
clustering is one where the clusters are informative about
the data objects they contain. Since data objects are
expressed in terms of attribute values, we require that the
clusters convey information about the attributevaluesof the
objects in the cluster. The evaluation of sequences
clustering is currently difficult due to the lack of an internal
validation criterion defined with regard to the structural
features hidden in sequences. To solve this problem, a novel
cluster validity index (CVI) is proposed as a function of
clustering, with the intra-clusterstructural compactness and
inter-cluster structural separation linearly combined to
measure the quality of sequence clusters. Cluster validation,
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3638
which is the process of evaluating the quality of clustering
results, plays an important role in many issues of cluster
analysis A partition-based algorithm for robust clustering of
categorical sequences andtheCVIarethenassembled within
the common model selection procedure to determine the
number of clusters in categorical sequence sets. Currently,
cluster validation remains an open problem due to the
unsupervised nature of clustering tasks, where no external
validation criterion is available to evaluate the result.
Despite different aspects of cluster validation, we are
interested in the problem of determining the optimal
number of clusters in a data set, as most basic clustering
algorithms assume that the number of clusters is a user-
defined parameter, which, however, is difficult to set in
practice.
2. RELATED WORK
K. A. Abdul Nazeer et al. [12] discuss in this paper
about the one major drawback of the k-means algorithm. K-
means algorithm, for different sets of values of initial
centroids, produces different clusters.Final clusterqualityin
algorithm depends on the selection of initial centroids. Two
phases includes in original k means algorithm: first for
determining initial centroids and second for assigning data
points to the nearest clusters and then recalculating the
clustering mean. An enhanced clustering method is
discussed in this paper, in which both the phases of the
original k-means algorithm are modified to improve the
accuracy and efficiency. This algorithm combines a
systematic method for finding initial centroids and an
efficient way for assigning data points to clusters. But still
there is a limitation in this enhanced algorithm that is the
value of k, the number of desired clusters, is still required to
be given as an input, regardless of the distribution of the
data points.
G. Dong and J. Pei, [10] says algorithm is expected to
produce high-quality results associated with the underlying
cluster structures in the data set given K. This problem
becomes difficult for categorical sequences,because,usually,
they are infected with significant quantities of noise, which
confuses the identification of cluster structures.Theclusters
containing less numberofobjects afterapplyinga algorithms
are considered as K-dependentnoises.Suchcluster-number-
dependent noises would mislead the validation criterion to
evaluate the clustering quality. The analysisofshortcomings
of the standard k-means algorithm. As k-means algorithm
has to calculate the distance betweeneachdata objectandall
cluster centers in each iteration. This repetitive process
affects the efficiency of clustering algorithm.
R. Xu and D. Wunsch [11], in the existing partition-
based methods, the performance of k-means algorithm
which is evaluated with various databasessuchasIris,Wine,
Vowel, Ionosphere and Crude oil data Set and various
distance metrics. proposed that due to the increment in the
amount of data across the world, analysis of the data turns
out to be very difficult task. To understand and learn the
data, classify those data into remarkable collection.So,there
is a need of data mining techniques. It is concluded that
performance of k-means clustering is depend on the data
base used as well as distance metrics. the comparative
analysis of one partition clustering algorithm (k means) and
one hierarchical clusteringalgorithm(agglomerative).On the
basis of accuracy and running time the performance of k-
means and hierarchical clustering algorithm is calculated
using WEKA tools. This work results that accuracy of k-
means is higher than the hierarchical clustering for iris
dataset which have real attributesandaccuracyofhierarchal
clustering is higher than the k-means for diabetes dataset
which have integer, real attribute. So for large datasets k
means algorithm is good.
R. Amutha et al. [13] proposed that when two or more
algorithms of same category of clustering technique is used
then best results will be acquired. Two k-means algorithms:
Parallel k/h-Means Clustering for Large Data Sets and A
Novel K-Means Based Clustering Algorithm for High
Dimensional Data Sets. Parallel k/h-Means algorithm is
designed to deal with very large data sets. Novel K-Means
Based Clustering provides the advantages of using both HC
and K-Means. Using these two algorithms, space and
similarity between the data sets present each nodes is
extended, however, there is a general lack of adaptive
mechanisms to deduce the reasonable number of clusters
with the possible K-dependent noises identified, given that
the user defined K likely deviates from the true number.
Solution to this problem a novel cluster validation method
which will provide robust clustering is used.
Navjot Kaur, Navneet Kaur[4]enhanced the
traditional k-means by introducing Ranking method.Author
introduces Ranking Method to overcome the deficiency of
more execution time taken by traditional k-means. The
Ranking Method is a way to find the occurrence of similar
data and to improve search effectiveness. The tool used to
implemented the improved algorithm is Visual Studio 2008
using C#. The advantages of k-means are also analyzed in
this paper. The author finds k-means as fast, robustandeasy
understandable algorithm. He also discuss that the clusters
are non-hierarchical in nature and are not overlapping in
nature. The process used in the algorithm takes student
marks as data set and then initial centroid is selected.
Euclidean distance is then calculated from centroid for each
data object. Then the threshold value is set for each data set.
Ranking Method is applied next and finally the clusters are
created based on minimum distance between the data point
and the centroid. The future scope of this paper is use of
Query Redirection can be used to cluster huge amount of
data from various databases.
3. METHODOLOGY
In this partition-based robust K-means for sequences
(RKMS) for robust clustering of categorical sequences, the
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3639
data which is given as input is in character form. There are
difficulties in defining an inherently meaningful measure of
similarity between sequences. To apply the clustering
algorithms and existing indices to sequences, one has to
resort to the vectorizationmethod. Toconvertthesequences
in vector form it is encoded into Latin-1, also called ISO-
8859-1, is an 8-bit character set endorsed by the
International Organization for Standardization (ISO) and
represents the alphabets of Western European languages.
Once the character file is converted to binary format we are
going to use Principal Component Analysis (PCA) algorithm.
The main idea of principal component analysis (PCA) is to
reduce the dimensionality of a data set consisting of many
variables correlated witheachother, eitherheavilyorlightly,
while retaining the variation present in thedataset,uptothe
maximum extent. Using PCA is very important because the
dataset which we have it is in very diverse format.Analyzing
this type of data becomes very difficult without using PCA.
After all this preprocessing we are ready to apply RKMS
algorithm on the data.
The number of clusters, K, produced by RKMS is not
necessarily equal to the given number ˆK (we will call ˆK the
expected number of clusters).The number of clusters is
adaptively determined by the elimination of K-dependent
noise during the RKMS clustering. Two key issues for a
robust K-means type algorithm are: the selection of the
initial cluster centroids and the identification of noise
clusters. In general, those groups that contain a small
number of objects can be viewed as noise. To identify the
noise cluster on threshold will be given and iftheobjectsina
cluster are less than thresholdthenthatclusterisconsidered
as noise.
After creating a cluster quality of the cluster is
measure by using Cluster Validity Index. For that structural
dissimilarity between the cluster object is considered. This
new index is linear combination of the within-cluster how
objects are scattered and between- cluster how objects are
separated and these both measures are computed on the
structural dissimilarity of the sequences which is
probabilistic approach.
Fig. 1. System Architecture of RKMS method
If we are sign high dimension data then robustness
of K-means algorithm may became a problem. There will be
two main issues we might face in this algorithm one is
selecting the initial centroied value. And another issue will
be finding and removing the noise cluster. The first issue
raises, because it is known that the performance of the K-
means algorithm, as an instance of the EM method [13], is
sensitive to the initial conditions (here, the initial cluster
centroids). Since the number of clusters is fixed at ˆK given
by the users, the traditional algorithm likely produces
clusters consisting of ˆK -dependent noise, especially when
the expected number deviates largely from thetruenumber.
RKMS aims at building a robust condition for the
coming iteration, by choosing a set of well-scattered objects
(sequences) as the initial cluster centroids. The selection
method is based on the greedy approach.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3640
The (k + 1)th initial centroid based on the maximum–
minimum principle, given by
In the traditional greedy approach, however, the first
centroid is chosen randomly or according to the length of
object vector. Since each sequence has beenrepresentedina
unit vector as Definition 1 shows,
we choose the first two centroids I1 and I2 by
That is, two sequences havingthemaximal pairwisedistance
are chosen as the initial centroids for C1 and C2; the
remaining Kˆ − 2 centroids are then determined by
repeatedly using (2) with k = 2,..., Kˆ − 1.
RKMS is to delete the possible K-dependent noise in each
iteration, to further enhance therobustnessofthealgorithm.
The details will be given in Section III-C. Note that the
number of resulting clustersproducedby RKMScouldbeless
than the given ˆK, due to the inclusion of this step in the
algorithm. This feature is somewhat similar to that of [10]
and [12], where the number canbereducedbyfadingout the
redundant clusters;however,RKMStargetstheidentification
of K-dependent noise.
Given a clustering of the sequence set, C, we now aim at
establishing a judgement criterion to examine whethereach
group Ck ∈ C is a noise cluster or not. In general, those
groups that contain a small number of objects can beviewed
as noise . According to this view, one can derive a necessary
condition for the judgement, i.e., nk = |Ck| ≤τ,wherenkisthe
number of objects in Ck and τ is a threshold defining the
minimal number in a normal cluster. If the condition holds
with τ = 1, then the cluster is undoubtedly a noise as it
consists of one single object. However,itisnotthecasewhen
1 < nk ≤ τ , because the assertion would be concerned with
the distribution of the objects in the cluster. To address the
problem, we propose a model-selection based method, in
view of the observation that if each object in a cluster is
indeed K-dependent noise, then the preferred clustering
model should be the one with that cluster removed. With
regard to Ck ∈ C, we denote the new clustering after the
removal of Ck by C_, created by reassigning each sequence
originally belonging to Ck to its closest cluster except Ck.
Thus, the number of clusters satisfies |C_| = |C| − 1 and,
consequently, the conclusion that Ck is a noise clustercanbe
drawn if the model of C_ is preferred to that of C.
Structure-based cluster validity index have two main
components first one is structural dissimilarity (SD) of
sequences and new cluster validity index(CVI). To measure
SD for sequences by Markovian modeling, which is based on
the hypothesis that occurrence of each symbol xl in the
sequence s = x1 . . . xl . . . xL is closely related to its preceding
subsequence x0x1 . . . xl−1. For a Markov model of order2 ꝭ ,
the preceding subsequence of xl is truncated to δl = xl−ꝭ . . .
xl−1, excluding xl− ꝭ for l < ꝭ. By such modeling, the
sequential dependence of each symbol xl on its preceding
subsequence δl can be characterized using the conditional
probability ps(xl |δl ) with regard to the sequence s.
where ps(δ) is the probability of δ with regard to s.
The SD between two sequences s and s_ can then be
numerically computed, by measuring the dissimilarity of
their probability distributions ps(X) and ps'(X).
Then next part is CVI VSD where it will validate the
quality of a series of clustering results,eachgenerated bythe
clustering algorithm on thesamesequenceset Swithvarious
numbers of sequence clusters. To distinguish the results
with different cluster numbers, we now use CK to denote C
consisting of K sequence clusters, i.e., CK = {C1, . . . ,Ck, . . . ,CK
}. Denote the minimal and the maximal numberofclustersin
the series by Kmin and Kmax, respectively. The new index is
a linear combination of the within-cluster scatter and
between-cluster separation, where both measures are
computed on the structural dissimilarity of sequences using
SD(·, ·) of (5). We define the total within-cluster scatter and
between-cluster separation of CK, denoted by Scat(CK ) and
Sep(CK ), respectively, as
Since, in practice, the number of clusters is ranged in [Kmin,
Kmax], we define the CVI as a linear combination of the
normalized measures, that is
where maxScat = maxK_∈[Kmin,Kmax] Scat(CK_ ) and
maxSep = maxK_∈[Kmin,Kmax] Sep(CK_ ). The index offersa
tradeoff between thewithin-clusterscatterandthebetween-
cluster separation. The minimal VSD is considered to be
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3641
associated with the clustering results produced by the
algorithm using the optimal number of clusters.
4. RESULT AND DISSCUSSION
The main purpose is detecting optimal cluster
numbers on "USDA Plants Dataset" with using K-means
clustering algorithm. Database contains all plants (species
and genera) in the database and the states of USA and
Canada where they occur.
The dataset used for this research contains 34781 records
with 70 features
DataSet Characteristics: Multivariate
Attribute Characteristics: Categorical
Associated Tasks: Clustering
Number of Instances: 22632
Number of Attributes: 70
Missing Values? Yes
Table -1: Dataset features
To implement this wehave Silhouetteanalysis.Itcanbeused
to study the separation distance between the resulting
clusters. The silhouette plot displays a measure of how close
each point in one cluster is to points in the neighboring
clusters and thus provides a way to assess parameters like
number of clusters visually. This measure has a range of [-1,
1]. Silhouette coefficients (as these values are referred to as)
near +1 indicate that the sample is far away from the
neighboring clusters. A value of 0 indicates that the sampleis
on or very close to the decision boundary between two
neighboring clusters and negative values indicate that those
samples might have been assigned to the wrong cluster.
BelowwehaveSilhouettecoefficientsfortheclusters
K- meansalgorithm on 1000attribute data for the respective
values of K.
K- Value Passed Silhouette coefficients
10 0.161727
20 0.214786
30 0.218644
40 0.255462
Table-2: K values with respective silhouette score
The performance of RKMS is evaluated in terms of clustering
accuracy (CA), which is computed as
where K is the true number and ak is the number of
sequences in the majority group corresponding to Ck. To
examine the strength of the robust method used to choose
the initial cluster centroids in RKMS, we also performed
clustering using K-means with random initialization [1], [4].
The data set was clustered by K-meansfor20 executionsand
both the average and best accuracy are reported After
applying RKMS algorithm on the same data set the optimal
number of cluster we got is as below.
Estimated number of cluster Silhouette coefficients
23 0.361
Fig-2. Cluster image with 500 data element
3. CONCLUSION
A robust algorithm named RKMS for partitionbased
sequence clustering. In the RKMS algorithm, unlike
traditional partitioning methods, the number of clusters (K)
can be automatically deducedbasedonthe expectednumber
( ˆK given by the users), by the eliminationofcluster-number
dependent noise (called K-dependent noise).RKMSandVSD
to the common model selection procedure, to estimate the
optimal number of clusters in a sequence set. A cluster
validation method for categorical sequence clustering is
proposed, which is different from the existing methods
mainly designed for numeric data. A probabilistic approach
to measure the structural dissimilarity of categorical
sequences, based on which the within-cluster compactness
and between-cluster separation are formulated, in order to
evaluate the structural similarity of sequences in each
cluster and the structural dissimilarity between sequence
clusters. To evaluate the quality of sequence clusters, a new
CVI named VSD by linearly combining the within-cluster
structural compactness and the between-cluster structural
separation.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072
© 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3642
REFERENCES
1] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Clustering
validity checking methods: Part II,” ACM SIGMOD Rec. Arch.,
vol. 31, no. 3, pp. 19–27, 2002.
2] C. C. Aggarwal, Data Mining: The textbook. New York, NY,
USA: Springer, 2015.
3] J. C. Dunn, “A fuzzy relative of the ISODATAprocessand its
use in detecting compact well-separated clusters,” J. Cybern.,
vol. 3, no. 3, pp. 32–57, 1973.
4] T. Cali´nski and J. Harabasz,“Adendritemethodforcluster
analysis,” Commun. Stat., vol. 3, no. 1, pp. 1–27, 1974.
5] H. Sun, S. Wang, and Q. Jiang, “FCM-based model selection
algorithms for determining the number of clusters,” Pattern
Recognit., vol. 37, no. 10, pp. 2027–2037, Oct. 2004.
6] L. Xu, T. W. S. Chow, and E. W. M. Ma, “Topology-based
clustering using polar self-organizing map,” IEEE Trans.
Neural Netw. Learn. Syst., vol. 24, no. 4, pp. 798–808,
7] D. Pelleg and A. W. Moore, “X-means: Extending k-means
with efficient estimation of the number of clusters,” in Proc.
17th ICML, 2000, pp. 727–734.
8] J. Yang and W. Wang, “CLUSEQ: Efficient and effective
sequence clustering,” in Proc. IEEE ICDE,Mar.2003,pp.101–
112.
9] M. M.-T. Chiang and B. Mirkin, “Intelligent choice of the
number of clusters in k-means clustering: An experimental
study with different cluster spreads,” J. Classif., vol. 27, no. 1,
pp. 3–40, 2010.
10] G. Dong and J. Pei, “Classification,clustering, features and
distances of sequence data,” Seq. Data Min., vol. 33, pp. 47–
65, 2007.
[11] R. Xu and D. Wunsch, “Survey of clustering algorithms,”
IEEE Trans Neural Netw., vol. 16, no. 3, pp. 645–678, May
2005.
12] K. A. Abdul Nazeer, M. P. Sebastian,.Improving the
Accuracy and Efficiency of thek-means Clustering
Algorithm.,Proceedings of the World Congress on
Engineering 2009 Vol I WCE 2009, July 1 - 3, 2009, London,
U.K.

More Related Content

What's hot

Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
 
IRJET - Survey on Clustering based Categorical Data Protection
IRJET - Survey on Clustering based Categorical Data ProtectionIRJET - Survey on Clustering based Categorical Data Protection
IRJET - Survey on Clustering based Categorical Data ProtectionIRJET Journal
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmIRJET Journal
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicBuilding a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicIJDKP
 
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...ijseajournal
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceIJDKP
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEIJDKP
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithmaciijournal
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSEditor IJCATR
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measuresthilagasna
 
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...Happiest Minds Technologies
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterIOSR Journals
 
Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...ijceronline
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 

What's hot (20)

Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
IRJET - Survey on Clustering based Categorical Data Protection
IRJET - Survey on Clustering based Categorical Data ProtectionIRJET - Survey on Clustering based Categorical Data Protection
IRJET - Survey on Clustering based Categorical Data Protection
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Review of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering AlgorithmReview of Existing Methods in K-means Clustering Algorithm
Review of Existing Methods in K-means Clustering Algorithm
 
G0354451
G0354451G0354451
G0354451
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy LogicBuilding a Classifier Employing Prism Algorithm with Fuzzy Logic
Building a Classifier Employing Prism Algorithm with Fuzzy Logic
 
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
Vinayaka : A Semi-Supervised Projected Clustering Method Using Differential E...
 
Recommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduceRecommendation system using bloom filter in mapreduce
Recommendation system using bloom filter in mapreduce
 
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSEA CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
A CONCEPTUAL METADATA FRAMEWORK FOR SPATIAL DATA WAREHOUSE
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkages
 
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary AlgorithmAutomatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
Automatic Unsupervised Data Classification Using Jaya Evolutionary Algorithm
 
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETSA HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
A HYBRID MODEL FOR MINING MULTI DIMENSIONAL DATA SETS
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measures
 
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
An Approach to Mixed Dataset Clustering and Validation with ART-2 Artificial ...
 
Classification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted ClusterClassification By Clustering Based On Adjusted Cluster
Classification By Clustering Based On Adjusted Cluster
 
Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...Study of the Class and Structural Changes Caused By Incorporating the Target ...
Study of the Class and Structural Changes Caused By Incorporating the Target ...
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 

Similar to IRJET- Optimal Number of Cluster Identification using Robust K-Means for Sequences in Categorical Sequences

A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET Journal
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...IRJET Journal
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...IOSR Journals
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET Journal
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmEditor IJCATR
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using dataeSAT Publishing House
 
Variance rover system
Variance rover systemVariance rover system
Variance rover systemeSAT Journals
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysisinventy
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A ReviewIOSRjournaljce
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 

Similar to IRJET- Optimal Number of Cluster Identification using Robust K-Means for Sequences in Categorical Sequences (20)

A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
IRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data MiningIRJET- A Detailed Study on Classification Techniques for Data Mining
IRJET- A Detailed Study on Classification Techniques for Data Mining
 
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
E-Healthcare monitoring System for diagnosis of Heart Disease using Machine L...
 
F04463437
F04463437F04463437
F04463437
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...A Study in Employing Rough Set Based Approach for Clustering  on Categorical ...
A Study in Employing Rough Set Based Approach for Clustering on Categorical ...
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
IRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction DataIRJET- Customer Segmentation from Massive Customer Transaction Data
IRJET- Customer Segmentation from Massive Customer Transaction Data
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
New Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids AlgorithmNew Approach for K-mean and K-medoids Algorithm
New Approach for K-mean and K-medoids Algorithm
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Lx3520322036
Lx3520322036Lx3520322036
Lx3520322036
 
Variance rover system web analytics tool using data
Variance rover system web analytics tool using dataVariance rover system web analytics tool using data
Variance rover system web analytics tool using data
 
Variance rover system
Variance rover systemVariance rover system
Variance rover system
 
Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Classification Techniques: A Review
Classification Techniques: A ReviewClassification Techniques: A Review
Classification Techniques: A Review
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 

More from IRJET Journal

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...IRJET Journal
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTUREIRJET Journal
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...IRJET Journal
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsIRJET Journal
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...IRJET Journal
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...IRJET Journal
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...IRJET Journal
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...IRJET Journal
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASIRJET Journal
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...IRJET Journal
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProIRJET Journal
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...IRJET Journal
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemIRJET Journal
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesIRJET Journal
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web applicationIRJET Journal
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...IRJET Journal
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.IRJET Journal
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...IRJET Journal
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignIRJET Journal
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...IRJET Journal
 

More from IRJET Journal (20)

TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
TUNNELING IN HIMALAYAS WITH NATM METHOD: A SPECIAL REFERENCES TO SUNGAL TUNNE...
 
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURESTUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
STUDY THE EFFECT OF RESPONSE REDUCTION FACTOR ON RC FRAMED STRUCTURE
 
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
A COMPARATIVE ANALYSIS OF RCC ELEMENT OF SLAB WITH STARK STEEL (HYSD STEEL) A...
 
Effect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil CharacteristicsEffect of Camber and Angles of Attack on Airfoil Characteristics
Effect of Camber and Angles of Attack on Airfoil Characteristics
 
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
A Review on the Progress and Challenges of Aluminum-Based Metal Matrix Compos...
 
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
Dynamic Urban Transit Optimization: A Graph Neural Network Approach for Real-...
 
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
Structural Analysis and Design of Multi-Storey Symmetric and Asymmetric Shape...
 
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
A Review of “Seismic Response of RC Structures Having Plan and Vertical Irreg...
 
A REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADASA REVIEW ON MACHINE LEARNING IN ADAS
A REVIEW ON MACHINE LEARNING IN ADAS
 
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
Long Term Trend Analysis of Precipitation and Temperature for Asosa district,...
 
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD ProP.E.B. Framed Structure Design and Analysis Using STAAD Pro
P.E.B. Framed Structure Design and Analysis Using STAAD Pro
 
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
A Review on Innovative Fiber Integration for Enhanced Reinforcement of Concre...
 
Survey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare SystemSurvey Paper on Cloud-Based Secured Healthcare System
Survey Paper on Cloud-Based Secured Healthcare System
 
Review on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridgesReview on studies and research on widening of existing concrete bridges
Review on studies and research on widening of existing concrete bridges
 
React based fullstack edtech web application
React based fullstack edtech web applicationReact based fullstack edtech web application
React based fullstack edtech web application
 
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
A Comprehensive Review of Integrating IoT and Blockchain Technologies in the ...
 
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
A REVIEW ON THE PERFORMANCE OF COCONUT FIBRE REINFORCED CONCRETE.
 
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
Optimizing Business Management Process Workflows: The Dynamic Influence of Mi...
 
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic DesignMultistoried and Multi Bay Steel Building Frame by using Seismic Design
Multistoried and Multi Bay Steel Building Frame by using Seismic Design
 
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
Cost Optimization of Construction Using Plastic Waste as a Sustainable Constr...
 

Recently uploaded

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptxmohitesoham12
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl
 

Recently uploaded (20)

Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Python Programming for basic beginners.pptx
Python Programming for basic beginners.pptxPython Programming for basic beginners.pptx
Python Programming for basic beginners.pptx
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
Katarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School CourseKatarzyna Lipka-Sidor - BIM School Course
Katarzyna Lipka-Sidor - BIM School Course
 

IRJET- Optimal Number of Cluster Identification using Robust K-Means for Sequences in Categorical Sequences

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3637 Optimal Number of Cluster Identification using Robust K-means for Sequences in Categorical Sequences S.U. Patil1, U.A. Nuli2 1,2Computer Science and Engineering department, M. Tech, Textile and Engineering Institute, Ichalkaranji, Maharashtra, India ---------------------------------------------------------------------***---------------------------------------------------------------------- Abstract - This paper presents a modified k-means algorithm for clustering. In traditional method of clustering number of clusterstobeformedwillbegivenat the start of the algorithm, which affects performance and efficiency of the algorithm. In Robust K-means for sequences optimal number of cluster will be predicted by removing noise cluster. Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, whichmainly focusedon attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. The intension of this paper is to describe the clustering method which will give the optimal number of clusters in categorical sequences. Key Words: Clusterign ,K-means,clustervalidationindex, categorical sequences, centroid. 1. INTRODUCTION Data mining is a process of deriving required data from a collection of large dataset and making analysis on collected data. The data mining technique is to mine informationfrom a bulky data set and make over it into a reasonable form for supplementary purpose. In generic term this is known as classification. In classification when the classes of an object is given in advance is termed as supervised classification where as the other case when the class label is not tagged to an object in advance is termed as unsupervised classification. The unsupervised classification is commonly known as clustering. Clustering is important analysis techniques that is employed to large datasets and finds its application in the fields like search engines, recommendation systems, data mining, knowledge discovery, bioinformaticsanddocumentation.Nowadays, the data being generated is not only huge in volume, but is also stored across various machines all around the world. The main purpose behind the study of classificationistodevelop a tool or an algorithm, which can be used to predict the class of an unknown object, which is not labeled. Clustering problem cannot be solvedbyonespecific algorithm but it requires various algorithms that differ significantly in their notion of what makes a clusterandhow to efficiently find them. Generally clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings depend on the individual data set and intended useoftheresults.Clustering is considered to be more difficult than supervised classification as there is no label attached to the patterns in clustering. The given label in the case of supervised classification becomes a clue to grouping data objects as a whole. Whereas in the case of clustering, it becomes difficult to decide, to which group a pattern will belong to, in the absence of a label. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values.Thisis often the case in many domains, wheredata isdescribedbya set of descriptive attributes, many of which are neither numerical nor inherently ordered in any way. Moreover clustering categorical sequences is a challenging problem due to one more reason the difficulties in defining an inherently meaningful measure of similarity between sequences. Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Without a measure of distance between data values, it is unclear how to define a quality measure for categorical clustering. To do this, we employ mutual information, a measure from information theory. A good clustering is one where the clusters are informative about the data objects they contain. Since data objects are expressed in terms of attribute values, we require that the clusters convey information about the attributevaluesof the objects in the cluster. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intra-clusterstructural compactness and inter-cluster structural separation linearly combined to measure the quality of sequence clusters. Cluster validation,
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3638 which is the process of evaluating the quality of clustering results, plays an important role in many issues of cluster analysis A partition-based algorithm for robust clustering of categorical sequences andtheCVIarethenassembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. Currently, cluster validation remains an open problem due to the unsupervised nature of clustering tasks, where no external validation criterion is available to evaluate the result. Despite different aspects of cluster validation, we are interested in the problem of determining the optimal number of clusters in a data set, as most basic clustering algorithms assume that the number of clusters is a user- defined parameter, which, however, is difficult to set in practice. 2. RELATED WORK K. A. Abdul Nazeer et al. [12] discuss in this paper about the one major drawback of the k-means algorithm. K- means algorithm, for different sets of values of initial centroids, produces different clusters.Final clusterqualityin algorithm depends on the selection of initial centroids. Two phases includes in original k means algorithm: first for determining initial centroids and second for assigning data points to the nearest clusters and then recalculating the clustering mean. An enhanced clustering method is discussed in this paper, in which both the phases of the original k-means algorithm are modified to improve the accuracy and efficiency. This algorithm combines a systematic method for finding initial centroids and an efficient way for assigning data points to clusters. But still there is a limitation in this enhanced algorithm that is the value of k, the number of desired clusters, is still required to be given as an input, regardless of the distribution of the data points. G. Dong and J. Pei, [10] says algorithm is expected to produce high-quality results associated with the underlying cluster structures in the data set given K. This problem becomes difficult for categorical sequences,because,usually, they are infected with significant quantities of noise, which confuses the identification of cluster structures.Theclusters containing less numberofobjects afterapplyinga algorithms are considered as K-dependentnoises.Suchcluster-number- dependent noises would mislead the validation criterion to evaluate the clustering quality. The analysisofshortcomings of the standard k-means algorithm. As k-means algorithm has to calculate the distance betweeneachdata objectandall cluster centers in each iteration. This repetitive process affects the efficiency of clustering algorithm. R. Xu and D. Wunsch [11], in the existing partition- based methods, the performance of k-means algorithm which is evaluated with various databasessuchasIris,Wine, Vowel, Ionosphere and Crude oil data Set and various distance metrics. proposed that due to the increment in the amount of data across the world, analysis of the data turns out to be very difficult task. To understand and learn the data, classify those data into remarkable collection.So,there is a need of data mining techniques. It is concluded that performance of k-means clustering is depend on the data base used as well as distance metrics. the comparative analysis of one partition clustering algorithm (k means) and one hierarchical clusteringalgorithm(agglomerative).On the basis of accuracy and running time the performance of k- means and hierarchical clustering algorithm is calculated using WEKA tools. This work results that accuracy of k- means is higher than the hierarchical clustering for iris dataset which have real attributesandaccuracyofhierarchal clustering is higher than the k-means for diabetes dataset which have integer, real attribute. So for large datasets k means algorithm is good. R. Amutha et al. [13] proposed that when two or more algorithms of same category of clustering technique is used then best results will be acquired. Two k-means algorithms: Parallel k/h-Means Clustering for Large Data Sets and A Novel K-Means Based Clustering Algorithm for High Dimensional Data Sets. Parallel k/h-Means algorithm is designed to deal with very large data sets. Novel K-Means Based Clustering provides the advantages of using both HC and K-Means. Using these two algorithms, space and similarity between the data sets present each nodes is extended, however, there is a general lack of adaptive mechanisms to deduce the reasonable number of clusters with the possible K-dependent noises identified, given that the user defined K likely deviates from the true number. Solution to this problem a novel cluster validation method which will provide robust clustering is used. Navjot Kaur, Navneet Kaur[4]enhanced the traditional k-means by introducing Ranking method.Author introduces Ranking Method to overcome the deficiency of more execution time taken by traditional k-means. The Ranking Method is a way to find the occurrence of similar data and to improve search effectiveness. The tool used to implemented the improved algorithm is Visual Studio 2008 using C#. The advantages of k-means are also analyzed in this paper. The author finds k-means as fast, robustandeasy understandable algorithm. He also discuss that the clusters are non-hierarchical in nature and are not overlapping in nature. The process used in the algorithm takes student marks as data set and then initial centroid is selected. Euclidean distance is then calculated from centroid for each data object. Then the threshold value is set for each data set. Ranking Method is applied next and finally the clusters are created based on minimum distance between the data point and the centroid. The future scope of this paper is use of Query Redirection can be used to cluster huge amount of data from various databases. 3. METHODOLOGY In this partition-based robust K-means for sequences (RKMS) for robust clustering of categorical sequences, the
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3639 data which is given as input is in character form. There are difficulties in defining an inherently meaningful measure of similarity between sequences. To apply the clustering algorithms and existing indices to sequences, one has to resort to the vectorizationmethod. Toconvertthesequences in vector form it is encoded into Latin-1, also called ISO- 8859-1, is an 8-bit character set endorsed by the International Organization for Standardization (ISO) and represents the alphabets of Western European languages. Once the character file is converted to binary format we are going to use Principal Component Analysis (PCA) algorithm. The main idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of many variables correlated witheachother, eitherheavilyorlightly, while retaining the variation present in thedataset,uptothe maximum extent. Using PCA is very important because the dataset which we have it is in very diverse format.Analyzing this type of data becomes very difficult without using PCA. After all this preprocessing we are ready to apply RKMS algorithm on the data. The number of clusters, K, produced by RKMS is not necessarily equal to the given number ˆK (we will call ˆK the expected number of clusters).The number of clusters is adaptively determined by the elimination of K-dependent noise during the RKMS clustering. Two key issues for a robust K-means type algorithm are: the selection of the initial cluster centroids and the identification of noise clusters. In general, those groups that contain a small number of objects can be viewed as noise. To identify the noise cluster on threshold will be given and iftheobjectsina cluster are less than thresholdthenthatclusterisconsidered as noise. After creating a cluster quality of the cluster is measure by using Cluster Validity Index. For that structural dissimilarity between the cluster object is considered. This new index is linear combination of the within-cluster how objects are scattered and between- cluster how objects are separated and these both measures are computed on the structural dissimilarity of the sequences which is probabilistic approach. Fig. 1. System Architecture of RKMS method If we are sign high dimension data then robustness of K-means algorithm may became a problem. There will be two main issues we might face in this algorithm one is selecting the initial centroied value. And another issue will be finding and removing the noise cluster. The first issue raises, because it is known that the performance of the K- means algorithm, as an instance of the EM method [13], is sensitive to the initial conditions (here, the initial cluster centroids). Since the number of clusters is fixed at ˆK given by the users, the traditional algorithm likely produces clusters consisting of ˆK -dependent noise, especially when the expected number deviates largely from thetruenumber. RKMS aims at building a robust condition for the coming iteration, by choosing a set of well-scattered objects (sequences) as the initial cluster centroids. The selection method is based on the greedy approach.
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3640 The (k + 1)th initial centroid based on the maximum– minimum principle, given by In the traditional greedy approach, however, the first centroid is chosen randomly or according to the length of object vector. Since each sequence has beenrepresentedina unit vector as Definition 1 shows, we choose the first two centroids I1 and I2 by That is, two sequences havingthemaximal pairwisedistance are chosen as the initial centroids for C1 and C2; the remaining Kˆ − 2 centroids are then determined by repeatedly using (2) with k = 2,..., Kˆ − 1. RKMS is to delete the possible K-dependent noise in each iteration, to further enhance therobustnessofthealgorithm. The details will be given in Section III-C. Note that the number of resulting clustersproducedby RKMScouldbeless than the given ˆK, due to the inclusion of this step in the algorithm. This feature is somewhat similar to that of [10] and [12], where the number canbereducedbyfadingout the redundant clusters;however,RKMStargetstheidentification of K-dependent noise. Given a clustering of the sequence set, C, we now aim at establishing a judgement criterion to examine whethereach group Ck ∈ C is a noise cluster or not. In general, those groups that contain a small number of objects can beviewed as noise . According to this view, one can derive a necessary condition for the judgement, i.e., nk = |Ck| ≤τ,wherenkisthe number of objects in Ck and τ is a threshold defining the minimal number in a normal cluster. If the condition holds with τ = 1, then the cluster is undoubtedly a noise as it consists of one single object. However,itisnotthecasewhen 1 < nk ≤ τ , because the assertion would be concerned with the distribution of the objects in the cluster. To address the problem, we propose a model-selection based method, in view of the observation that if each object in a cluster is indeed K-dependent noise, then the preferred clustering model should be the one with that cluster removed. With regard to Ck ∈ C, we denote the new clustering after the removal of Ck by C_, created by reassigning each sequence originally belonging to Ck to its closest cluster except Ck. Thus, the number of clusters satisfies |C_| = |C| − 1 and, consequently, the conclusion that Ck is a noise clustercanbe drawn if the model of C_ is preferred to that of C. Structure-based cluster validity index have two main components first one is structural dissimilarity (SD) of sequences and new cluster validity index(CVI). To measure SD for sequences by Markovian modeling, which is based on the hypothesis that occurrence of each symbol xl in the sequence s = x1 . . . xl . . . xL is closely related to its preceding subsequence x0x1 . . . xl−1. For a Markov model of order2 ꝭ , the preceding subsequence of xl is truncated to δl = xl−ꝭ . . . xl−1, excluding xl− ꝭ for l < ꝭ. By such modeling, the sequential dependence of each symbol xl on its preceding subsequence δl can be characterized using the conditional probability ps(xl |δl ) with regard to the sequence s. where ps(δ) is the probability of δ with regard to s. The SD between two sequences s and s_ can then be numerically computed, by measuring the dissimilarity of their probability distributions ps(X) and ps'(X). Then next part is CVI VSD where it will validate the quality of a series of clustering results,eachgenerated bythe clustering algorithm on thesamesequenceset Swithvarious numbers of sequence clusters. To distinguish the results with different cluster numbers, we now use CK to denote C consisting of K sequence clusters, i.e., CK = {C1, . . . ,Ck, . . . ,CK }. Denote the minimal and the maximal numberofclustersin the series by Kmin and Kmax, respectively. The new index is a linear combination of the within-cluster scatter and between-cluster separation, where both measures are computed on the structural dissimilarity of sequences using SD(·, ·) of (5). We define the total within-cluster scatter and between-cluster separation of CK, denoted by Scat(CK ) and Sep(CK ), respectively, as Since, in practice, the number of clusters is ranged in [Kmin, Kmax], we define the CVI as a linear combination of the normalized measures, that is where maxScat = maxK_∈[Kmin,Kmax] Scat(CK_ ) and maxSep = maxK_∈[Kmin,Kmax] Sep(CK_ ). The index offersa tradeoff between thewithin-clusterscatterandthebetween- cluster separation. The minimal VSD is considered to be
  • 5. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3641 associated with the clustering results produced by the algorithm using the optimal number of clusters. 4. RESULT AND DISSCUSSION The main purpose is detecting optimal cluster numbers on "USDA Plants Dataset" with using K-means clustering algorithm. Database contains all plants (species and genera) in the database and the states of USA and Canada where they occur. The dataset used for this research contains 34781 records with 70 features DataSet Characteristics: Multivariate Attribute Characteristics: Categorical Associated Tasks: Clustering Number of Instances: 22632 Number of Attributes: 70 Missing Values? Yes Table -1: Dataset features To implement this wehave Silhouetteanalysis.Itcanbeused to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1]. Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sampleis on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster. BelowwehaveSilhouettecoefficientsfortheclusters K- meansalgorithm on 1000attribute data for the respective values of K. K- Value Passed Silhouette coefficients 10 0.161727 20 0.214786 30 0.218644 40 0.255462 Table-2: K values with respective silhouette score The performance of RKMS is evaluated in terms of clustering accuracy (CA), which is computed as where K is the true number and ak is the number of sequences in the majority group corresponding to Ck. To examine the strength of the robust method used to choose the initial cluster centroids in RKMS, we also performed clustering using K-means with random initialization [1], [4]. The data set was clustered by K-meansfor20 executionsand both the average and best accuracy are reported After applying RKMS algorithm on the same data set the optimal number of cluster we got is as below. Estimated number of cluster Silhouette coefficients 23 0.361 Fig-2. Cluster image with 500 data element 3. CONCLUSION A robust algorithm named RKMS for partitionbased sequence clustering. In the RKMS algorithm, unlike traditional partitioning methods, the number of clusters (K) can be automatically deducedbasedonthe expectednumber ( ˆK given by the users), by the eliminationofcluster-number dependent noise (called K-dependent noise).RKMSandVSD to the common model selection procedure, to estimate the optimal number of clusters in a sequence set. A cluster validation method for categorical sequence clustering is proposed, which is different from the existing methods mainly designed for numeric data. A probabilistic approach to measure the structural dissimilarity of categorical sequences, based on which the within-cluster compactness and between-cluster separation are formulated, in order to evaluate the structural similarity of sequences in each cluster and the structural dissimilarity between sequence clusters. To evaluate the quality of sequence clusters, a new CVI named VSD by linearly combining the within-cluster structural compactness and the between-cluster structural separation.
  • 6. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 06 Issue: 06 | June 2019 www.irjet.net p-ISSN: 2395-0072 © 2019, IRJET | Impact Factor value: 7.211 | ISO 9001:2008 Certified Journal | Page 3642 REFERENCES 1] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Clustering validity checking methods: Part II,” ACM SIGMOD Rec. Arch., vol. 31, no. 3, pp. 19–27, 2002. 2] C. C. Aggarwal, Data Mining: The textbook. New York, NY, USA: Springer, 2015. 3] J. C. Dunn, “A fuzzy relative of the ISODATAprocessand its use in detecting compact well-separated clusters,” J. Cybern., vol. 3, no. 3, pp. 32–57, 1973. 4] T. Cali´nski and J. Harabasz,“Adendritemethodforcluster analysis,” Commun. Stat., vol. 3, no. 1, pp. 1–27, 1974. 5] H. Sun, S. Wang, and Q. Jiang, “FCM-based model selection algorithms for determining the number of clusters,” Pattern Recognit., vol. 37, no. 10, pp. 2027–2037, Oct. 2004. 6] L. Xu, T. W. S. Chow, and E. W. M. Ma, “Topology-based clustering using polar self-organizing map,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 4, pp. 798–808, 7] D. Pelleg and A. W. Moore, “X-means: Extending k-means with efficient estimation of the number of clusters,” in Proc. 17th ICML, 2000, pp. 727–734. 8] J. Yang and W. Wang, “CLUSEQ: Efficient and effective sequence clustering,” in Proc. IEEE ICDE,Mar.2003,pp.101– 112. 9] M. M.-T. Chiang and B. Mirkin, “Intelligent choice of the number of clusters in k-means clustering: An experimental study with different cluster spreads,” J. Classif., vol. 27, no. 1, pp. 3–40, 2010. 10] G. Dong and J. Pei, “Classification,clustering, features and distances of sequence data,” Seq. Data Min., vol. 33, pp. 47– 65, 2007. [11] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Trans Neural Netw., vol. 16, no. 3, pp. 645–678, May 2005. 12] K. A. Abdul Nazeer, M. P. Sebastian,.Improving the Accuracy and Efficiency of thek-means Clustering Algorithm.,Proceedings of the World Congress on Engineering 2009 Vol I WCE 2009, July 1 - 3, 2009, London, U.K.