SlideShare a Scribd company logo
1 of 7
Download to read offline
Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340
© 2014, IJ
CS
MCAll RightsReserved 334
Available Online at www.ijcsmc.com
International Journal of Computer Science and Mobile Computing
A Monthly Journal of Computer Science and Information Technology
ISSN 2320–088X
IJCSMC, Vol. 3, Issue. 1, January 2014, pg.334 – 340
RESEARCH ARTICLE
An Analysis on Clustering Algorithms in
Data Mining
Mythili S1
, Madhiya E2
Assistant Professor, PSGR Krishnammal College for Women, Coimbatore1
Assistant Professor, PSGR Krishnammal College for Women, Coimbatore2
smythisri@gmail.com, emadhiya@gmail.com
Abstract:
Clustering is the grouping together of similar data items into clusters. Clustering analysis is one
of the main analytical methods in data mining; the method of clustering algorithm will influence the
clustering results directly. This paper discusses the various types of algorithms like k-means clustering
algorithms, etc…. and analyzes the advantages and shortcomings of the various algorithms. In each type
we can calculate the distance between each data object and all cluster centers in each iteration, which
makes the efficiency of clustering is not high. This paper provides a broad survey of the most basic
techniques and identifies .This paper also deals with the issues of clustering algorithm such as time
complexity and accuracy to provide the better results based on various environments. The results are
discussed on huge datasets.
Keywords: Clustering; datasets; machine-learning; deterministic
I INTRODUCTION
A large number of clustering definitions can be found in the literature, from simple to elaborate.
The simplest definition is shared among all and includes one fundamental concept: the grouping together
of similar data items into clusters. Cluster analysis is the organization of a collection of patterns (usually
represented as a vector of measurements,or a point in a multidimensional space) into clusters based on
similarity. It is important to understand the difference between clustering (unsupervised classification)
and discriminant analysis (supervised classification). In supervised classification, we are provided with a
collection of labeled (preclassified) patterns [1]; the problem is to label a newly encountered, yet
Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340
© 2014, IJ
CS
MCAll RightsReserved 335
unlabeled, pattern. Typically, the given labeled (training) patterns are used to learn the descriptions of
classes which in turn are used to label a new pattern. In the case of clustering, the problem is to group a
given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with
clusters also, but these category labels are data driven; that is, they are obtained solely from the data.
Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and
machine-learning situations, including data mining, document retrieval, image segmentation, and pattern
classification. However, in many such problems, there is little prior information (e.g., statistical models)
available about the data, and the decision-maker must make as few assumptions about the data as
possible. It is under these restrictions that clustering methodology is particularly appropriate for the
exploration of interrelationships among the data points to make an assessment (perhaps preliminary) of
their structure. The term “clustering” is used in several research communities to describe methods for
grouping of unlabeled data[2]. These communities have different terminologies and assumptions for the
components of the clustering process and the contexts in which clustering is used. Thus, we face a
dilemma regarding the scope of this survey. The production of a truly comprehensive survey would be a
monumental task given the sheer mass of literature in this area. The accessibility of the survey might also
be questionable given the need to reconcile very different vocabularies and assumptions regarding
clustering in the various communities [3]. The goal of this paper is to survey the core concepts and
techniques in the large subset of cluster analysis with its roots in statistics and decision theory. Where
appropriate, references will be made to key concepts and techniques arising from clustering methodology
in the machine-learning and other communities.
Figure 1: Stages in Clustering
Typical pattern clustering activity involves the following steps [Jain and Dubes 1988]:
(1) pattern representation (optionally including feature extraction and/or selection),
(2) definition of a pattern proximity measure appropriate to the data domain,
(3) clustering or grouping,
(4) data abstraction (if needed), and
(5) assessment of output (if needed).
II LITERATURE SURVEY
Different approaches to clustering data can be described with the help of the hierarchy (other
taxonometric representations of clustering methodology are possible; ours is based on the discussion in
Jain and Dubes[1988]). At the top level, there is a distinction between hierarchical and partitional
approaches (hierarchical methods produce a nested series of partitions, while partitional methods produce
only one).
—Agglomerative vs. divisive: This aspect relates to algorithmic structure and operation. An
agglomerative approach[4] begins with each pattern in a distinct (singleton) cluster, and successively
merges clusters together until a stopping criterion is satisfied. A divisive method begins with all patterns
in a single cluster and performs splitting until a stopping criterion is met.
—Monothetic vs. polythetic: This aspect relates to the sequential or simultaneous use of features in the
clusteringprocess. Most algorithms are polythetic; that is, all features enter into the computation of
distances between
Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340
© 2014, IJ
CS
MCAll RightsReserved 336
patterns [5], and decisions are based on those distances. A simple monothetic algorithm reported in
Anderberg [1973] considers features sequentially to divide the given collection of patterns.[14]Here, the
collection is divided into two groups using feature x1; the vertical broken line V is the separating line.
Each of these clusters is further divided independently using feature x2, as depicted by the broken lines
H1 and H2. The major problem with this algorithm is that it generates 2d clusters where d is the
dimensionality of the patterns. For large values of d (d . 100 is typical in information retrieval
applications [Salton 1991]),the number of clusters generated by this algorithm is so large that the data set
is divided into uninterestingly small and fragmented clusters.
—Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its
operation and in its output. A fuzzy clustering method assigns degrees of membership in several clusters
to each input pattern. A fuzzy clustering can be converted to a hard clustering by assigning each pattern to
the cluster with the largest measure of membership.
—Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to optimize
a squared error function. This optimization can be accomplished using traditional techniques or through a
random search[6] of the state space consisting of all possible labelings.
Hierarchical Clustering Algorithms
A representative algorithm of this kind is hierarchical clustering, which is implemented in the popular
numerical software MATLAB [15]. This algorithm is an agglomerative algorithm that has several
variations depending on the metric used to measure the distances among the clusters. The Euclidean
distance is usually used for individual points . There are no known criteria of which clustering distance
should be used, and it seems to depend strongly on the dataset. Among the most used variations of the
hierarchical clustering based on different distance measures are [16]:
1. Average linkage clustering
The dissimilarity between clusters is calculated using average values.The average distance is calculated
from the distance between each point in a cluster and all other points in another cluster. The two clusters
with the lowest average distance are joined together to form the newcluster.
2. Centroid linkage clustering
This variation uses the group centroid as the average. The centroid is defined as the center of a cloud of
points.
3. Complete linkage clustering (Maximum or Furthest-Neighbor Method)The dissimilarity between 2
groups is equal to the greatest dissimilarity between a member of cluster i and a member of cluster j. This
method tends to produce very tight clusters of similar cases.
4. Single linkage clustering (Minimum or Nearest-Neighbor Method): The dissimilarity between 2
clusters is the minimum dissimilarity between members of the two clusters. This method produces long
chains whichform loose, straggly clusters.
5. Ward's Method: Cluster membership is assigned by calculating the total sum of squared deviations
from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase
in the error sum of squares.
Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340
© 2014, IJ
CS
MCAll RightsReserved 337
Figure 2: Centroid linkage clustering
Single linkage clustering algorithm Let be D(i; j) the distance between clusters i ,in this case defined as
was describe above, and j and N(i) the nearest neighbor of cluster i[7].
1. Initialize as many clusters as data points
2. For each pair of clusters (i; j) compute D(i; j)
3. For each cluster i compute N(i)
4. Repeat until obtain the desired number of clusters
(a) Determine i; j such that D(i; j) is minimized
(b) Agglomerate cluster i and j
(c) Update each D(i; j) and N(i) as necessary
5. End of repeat
Partitional Algorithms
A partitional clustering algorithm obtains a single partition of the data instead of a clustering
structure, such as the dendrogram produced by a hierarchical technique. Partitional methods have
advantages in applications involving large data sets for which the construction of a dendrogram is
computationally prohibitive[8]. A problem accompanying the use of a partitional algorithm is the choice
of the number of desired output clusters. A seminal paper [Dubes 1987] provides guidance on this key
design decision. The partitional techniques usually produce clusters by optimizing a criterion function
defined either locally (on a subset of the patterns) or globally (defined over all of the patterns).
Combinatorial search of the set of possible labelings for an optimum value of a criterion is clearly
computationally prohibitive. In practice, therefore, the algorithm is typically run multiple times with
different starting states, and the best configuration obtained from all of the runs is used as the output
clustering.
Nearest Neighbor Clustering
Since proximity plays a key role in our intuitive notion of a cluster, nearest neighbor distances
can serve as the basis of clustering procedures. An iterative procedure was proposed in Lu and Fu [1978];
it assigns each unlabeled pattern to the cluster of its nearest labeled neighbor pattern, provided the
distance to that labeled neighbor is below a threshold [9]. The process continues until all patterns are
labeled or no additional labelings occur. The mutual neighborhood value (described earlier in the context
of distance computation) can also be used to grow clusters from near neighbors.
Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340
© 2014, IJ
CS
MCAll RightsReserved 338
Fuzzy Clustering
Traditional clustering approaches generate partitions; in a partition, each pattern belongs to one
and only one cluster. Hence, the clusters in a hard clustering are disjoint. Fuzzy clustering extends this
notion to associate each pattern[13] with every cluster using a membership function [Zadeh 1965]. The
output of such algorithms is a clustering, but not a partition.
Figure 4: Fuzzy Clustering
III A Comparison of Techniques
In this section we have examined various deterministic and stochastic search techniques to
approach the clustering problem as an optimization problem. A majority of these methods use the squared
error criterion function. Hence, the partitions generated by these approaches are not as versatile as those
generated by hierarchical algorithms. The clusters generated are typically hyperspherical in shape.
Evolutionary approaches are globalized search techniques, whereas the rest of the approaches are
localized search technique [10]. ANNs and GAs are inherently parallel, so they can be implemented using
parallel hardware to improve their speed. Evolutionary approaches are population-based; that is, they
search using more than one solution at a time, and the rest are based on using a single solution at a time.
ANNs, GAs, SA, and Tabu search (TS) are all sensitive to the selection of various learning/control
parameters. In theory, all four of these methods are weak methods [Rich 1983] in that they do not use
explicit domain knowledge. An important feature of the evolutionary approaches is that they can find the
optimal solution even when the criterion function is discontinuous [11].
K-means algorithm
The K-means algorithm, probably the first one of the clustering algorithms proposed, is based on
a very simple idea: Given a set of initial clusters, assign each point to one of them, then each cluster
center is replaced by the mean point on the respective cluster [17]. These two simple steps are repeated
until convergence. A point is assigned to the cluster which is close in Euclidean distance to the point.
Although K-means has the great advantage of being easy to implement, it has two big drawbacks [12].
First, it can be really slow since in each step the distance between each point to each cluster has to be
calculated, which can be really expensive in the presence of a large dataset. Second, this method is really
sensitive to the provided initial clusters, however, in recent years, this problem has been addressed with
some degree of success.
Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340
© 2014, IJ
CS
MCAll RightsReserved 339
IV Conclusions and Future Directions
Given a data set, the ideal scenario would be to have a given set of criteria choose a proper
clustering algorithm to apply. Choosing a clustering algorithm, however, can be a difficult task. Even
finding just the most relevant approaches for a given data set is hard. Most of the algorithms generally
assume some implicit structure in the data set. The problem, however, is that usually you have little or no
information regarding the structure, which is, paradoxically, what you want to uncover. The worst case
would be one in which previous information about the data or the clusters is unknown, and a process of
trial and error is the best option. However, there are many elements that are usually known, and can be
helpful in choosing an algorithm. One of the most important elements is the nature of the data and the
nature of the desired cluster. Another issue to keep in mind is the kind of input and tools that the
algorithm requires. For example, some algorithms use numerical inputs, some use categorical inputs;
some require a definition of a distance or similarity measures for the data. The size of the data set is also
important to keep in mind, because most of the clustering algorithms require multiple data scans to
achieve convergence, a good discussion of this problems.
An additional issue related to selecting an algorithm is correctly choosing the initial set of
clusters. As was shown in the numerical results, an adequate choice of clusters can strongly influence
both the quality of and the time required to obtain a solution. Also important is that some clustering
methods, such as hierarchical clustering, need a distance matrix which contains all the distances between
every pair of elements in the data set. While these methods rely on simplicity, the size of this matrix is of
the size m2, which can be prohibitive due to memory constraints, as was shown in the experiments.
Recently this issue has been addressed, resulting in new variations of hierarchical and reciprocal nearest
neighbor clustering. This paper provides a broad survey of the most basic techniques.
REFERENCES
[1] L. Parsons, E. Haque, and H. Liu, "Subspace clustering forhigh dimensional data: a review," ACM
SIGKDDExplorations Newsletter, vol. 6, pp. 90-105, 2004.
[2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, "Aframework for projected clustering of high
dimensional datastreams," in Proceedings of the Thirtieth internationalconference on Very large data
bases-Volume 30, 2004, p.863.
[3] R. Agrawal, J. E. Gehrke, D. Gunopulos, and P. Raghavan,"Automatic subspace clustering of high
dimensional data fordata mining applications," Google Patents, 1999.
[4] C. Domeniconi, D. Papadopoulos, D. Gunopulos, and S.Ma, "Subspace clustering of high dimensional
data," 2004.
[5] X. Z. Fern and C. E. Brodley, "Random projection for highdimensional data clustering: A cluster
ensemble approach,"2003, p. 186.
[6] H. P. Kriegel, P. Kröger, and A. Zimek, "Clusteringhigh-dimensional data: A survey on subspace
clustering,pattern-based clustering, and correlation clustering," 2009.
[7] R. XU and I. Donald C. Wunsch, clustering: A Johnwiley & Sons, INC., Pub, 2008.
[8] Guha, Meyerson, A. Mishra, N. Motwani, and O. C. ."Clustering data streams: Theory and practice ."
IEEETransactions on Knowledge and Data Engineering, vol. 15,pp. 515-528, 2003.
[9] A. Jain , M. Murty , and p. Flynn " Data clustering: A review.," ACM Computing Surveys, vol. 31, pp.
264-323,
1999.
[10] P. C. Biswal, Discrete Mathematics and Graph Theory. New Delhi: Prentice Hall of India, 2005.
[11] M. Khalilian, Discrete Mathematics Structure. Karaj: Sarafraz, 2004.
[12] K. J. Cios, W. Pedrycz, and R. M. Swiniarsk, "Data miningmethods for knowledge discovery," IEEE
Transactions on Neural Networks, vol. 9, pp. 1533-1534, 1998.
[13] V. V. Raghavan and K. Birchard, "A clustering strategy based on a formalism of the reproductive
process in natural systems," 1979, pp. 10-22.
Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340
© 2014, IJ
CS
MCAll RightsReserved 340
[14] G. Salton and C. Buckley, "Term weighting approaches in automatic text retrieval," Inf Process
Manage, vol. 24, pp.513-523, 1987.
[15] G. Salton, "Automatic text processing: the transformation,"Analysis and Retrieval of Information by
Computer, 1989.
[16] D. Isa, V. P. Kallimani, and L. H. Lee, "Using the self organizing map for clustering of text
documents," Expert Systems With Applications, vol. 36, pp. 9584-9591, 2009
[17] Shi Na,Liu Xumin, “Research on k-means Clustering Algorithm”,IEEE Third International
Conference on Intelligent Information Technology and Security Informatics,2010.

More Related Content

Similar to An Analysis On Clustering Algorithms In Data Mining

Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisIOSR Journals
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Juha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somJuha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somArchiLab 7
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsIJMER
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
clustering and distance metrics.pptx
clustering and distance metrics.pptxclustering and distance metrics.pptx
clustering and distance metrics.pptxssuser2e437f
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problemcsandit
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataijcsit
 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data ClusteringAn Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data Clusteringijeei-iaes
 
clustering ppt.pptx
clustering ppt.pptxclustering ppt.pptx
clustering ppt.pptxchmeghana1
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmIJMIT JOURNAL
 

Similar to An Analysis On Clustering Algorithms In Data Mining (20)

Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
Az36311316
Az36311316Az36311316
Az36311316
 
cluster.pptx
cluster.pptxcluster.pptx
cluster.pptx
 
Comparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data AnalysisComparison Between Clustering Algorithms for Microarray Data Analysis
Comparison Between Clustering Algorithms for Microarray Data Analysis
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Juha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somJuha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the som
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
A Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text DocumentsA Novel Clustering Method for Similarity Measuring in Text Documents
A Novel Clustering Method for Similarity Measuring in Text Documents
 
Rohit 10103543
Rohit 10103543Rohit 10103543
Rohit 10103543
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
clustering and distance metrics.pptx
clustering and distance metrics.pptxclustering and distance metrics.pptx
clustering and distance metrics.pptx
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
Multilevel techniques for the clustering problem
Multilevel techniques for the clustering problemMultilevel techniques for the clustering problem
Multilevel techniques for the clustering problem
 
A comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray dataA comparative study of clustering and biclustering of microarray data
A comparative study of clustering and biclustering of microarray data
 
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data ClusteringAn Heterogeneous Population-Based Genetic Algorithm for Data Clustering
An Heterogeneous Population-Based Genetic Algorithm for Data Clustering
 
clustering ppt.pptx
clustering ppt.pptxclustering ppt.pptx
clustering ppt.pptx
 
K044055762
K044055762K044055762
K044055762
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Extended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithmExtended pso algorithm for improvement problems k means clustering algorithm
Extended pso algorithm for improvement problems k means clustering algorithm
 

More from Gina Rizzo

How To Write An Empathy Essay By Jones Jessica - I
How To Write An Empathy Essay By Jones Jessica - IHow To Write An Empathy Essay By Jones Jessica - I
How To Write An Empathy Essay By Jones Jessica - IGina Rizzo
 
Rocket Outer Space Lined Paper Lined Paper, Writin
Rocket Outer Space Lined Paper Lined Paper, WritinRocket Outer Space Lined Paper Lined Paper, Writin
Rocket Outer Space Lined Paper Lined Paper, WritinGina Rizzo
 
College Research Paper Writing S
College Research Paper Writing SCollege Research Paper Writing S
College Research Paper Writing SGina Rizzo
 
Research Paper Executive Summary How To Write
Research Paper Executive Summary How To WriteResearch Paper Executive Summary How To Write
Research Paper Executive Summary How To WriteGina Rizzo
 
Hypothesis Experiment 4
Hypothesis Experiment 4Hypothesis Experiment 4
Hypothesis Experiment 4Gina Rizzo
 
Descriptive Essay Introduction Sa
Descriptive Essay Introduction SaDescriptive Essay Introduction Sa
Descriptive Essay Introduction SaGina Rizzo
 
Writing A Personal Letter - MakeMyAssignments Blog
Writing A Personal Letter - MakeMyAssignments BlogWriting A Personal Letter - MakeMyAssignments Blog
Writing A Personal Letter - MakeMyAssignments BlogGina Rizzo
 
How To Write Better Essays Pdf - BooksFree
How To Write Better Essays Pdf - BooksFreeHow To Write Better Essays Pdf - BooksFree
How To Write Better Essays Pdf - BooksFreeGina Rizzo
 
97 In Text Citation Poetry Mla
97 In Text Citation Poetry Mla97 In Text Citation Poetry Mla
97 In Text Citation Poetry MlaGina Rizzo
 
Heart Template - 6 Inch - TimS Printables - Free He
Heart Template - 6 Inch - TimS Printables - Free HeHeart Template - 6 Inch - TimS Printables - Free He
Heart Template - 6 Inch - TimS Printables - Free HeGina Rizzo
 
5 Components Of Fitness Worksheet
5 Components Of Fitness Worksheet5 Components Of Fitness Worksheet
5 Components Of Fitness WorksheetGina Rizzo
 
Cursive Alphabet Zaner Bloser AlphabetWorksheetsFree.Com
Cursive Alphabet Zaner Bloser AlphabetWorksheetsFree.ComCursive Alphabet Zaner Bloser AlphabetWorksheetsFree.Com
Cursive Alphabet Zaner Bloser AlphabetWorksheetsFree.ComGina Rizzo
 
How To Start Your Introduction For A Research Paper. How To Write
How To Start Your Introduction For A Research Paper. How To WriteHow To Start Your Introduction For A Research Paper. How To Write
How To Start Your Introduction For A Research Paper. How To WriteGina Rizzo
 
Custom Admission Essay Dnp A Writing Service Wi
Custom Admission Essay Dnp A Writing Service WiCustom Admission Essay Dnp A Writing Service Wi
Custom Admission Essay Dnp A Writing Service WiGina Rizzo
 
Blank Torn White Paper Template Premium Image
Blank Torn White Paper Template Premium ImageBlank Torn White Paper Template Premium Image
Blank Torn White Paper Template Premium ImageGina Rizzo
 
Green, Yellow, Red The Keys To The Perfect Persua
Green, Yellow, Red The Keys To The Perfect PersuaGreen, Yellow, Red The Keys To The Perfect Persua
Green, Yellow, Red The Keys To The Perfect PersuaGina Rizzo
 
FCE Exam Writing Samples - My Hometown Essay Writi
FCE Exam Writing Samples - My Hometown Essay WritiFCE Exam Writing Samples - My Hometown Essay Writi
FCE Exam Writing Samples - My Hometown Essay WritiGina Rizzo
 
Referencing Essay
Referencing EssayReferencing Essay
Referencing EssayGina Rizzo
 
How To Teach Opinion Writing Tips And Resources Artofit
How To Teach Opinion Writing Tips And Resources ArtofitHow To Teach Opinion Writing Tips And Resources Artofit
How To Teach Opinion Writing Tips And Resources ArtofitGina Rizzo
 
Fantasy Space Writing Paper By Miss Cleve Tea
Fantasy Space Writing Paper By Miss Cleve TeaFantasy Space Writing Paper By Miss Cleve Tea
Fantasy Space Writing Paper By Miss Cleve TeaGina Rizzo
 

More from Gina Rizzo (20)

How To Write An Empathy Essay By Jones Jessica - I
How To Write An Empathy Essay By Jones Jessica - IHow To Write An Empathy Essay By Jones Jessica - I
How To Write An Empathy Essay By Jones Jessica - I
 
Rocket Outer Space Lined Paper Lined Paper, Writin
Rocket Outer Space Lined Paper Lined Paper, WritinRocket Outer Space Lined Paper Lined Paper, Writin
Rocket Outer Space Lined Paper Lined Paper, Writin
 
College Research Paper Writing S
College Research Paper Writing SCollege Research Paper Writing S
College Research Paper Writing S
 
Research Paper Executive Summary How To Write
Research Paper Executive Summary How To WriteResearch Paper Executive Summary How To Write
Research Paper Executive Summary How To Write
 
Hypothesis Experiment 4
Hypothesis Experiment 4Hypothesis Experiment 4
Hypothesis Experiment 4
 
Descriptive Essay Introduction Sa
Descriptive Essay Introduction SaDescriptive Essay Introduction Sa
Descriptive Essay Introduction Sa
 
Writing A Personal Letter - MakeMyAssignments Blog
Writing A Personal Letter - MakeMyAssignments BlogWriting A Personal Letter - MakeMyAssignments Blog
Writing A Personal Letter - MakeMyAssignments Blog
 
How To Write Better Essays Pdf - BooksFree
How To Write Better Essays Pdf - BooksFreeHow To Write Better Essays Pdf - BooksFree
How To Write Better Essays Pdf - BooksFree
 
97 In Text Citation Poetry Mla
97 In Text Citation Poetry Mla97 In Text Citation Poetry Mla
97 In Text Citation Poetry Mla
 
Heart Template - 6 Inch - TimS Printables - Free He
Heart Template - 6 Inch - TimS Printables - Free HeHeart Template - 6 Inch - TimS Printables - Free He
Heart Template - 6 Inch - TimS Printables - Free He
 
5 Components Of Fitness Worksheet
5 Components Of Fitness Worksheet5 Components Of Fitness Worksheet
5 Components Of Fitness Worksheet
 
Cursive Alphabet Zaner Bloser AlphabetWorksheetsFree.Com
Cursive Alphabet Zaner Bloser AlphabetWorksheetsFree.ComCursive Alphabet Zaner Bloser AlphabetWorksheetsFree.Com
Cursive Alphabet Zaner Bloser AlphabetWorksheetsFree.Com
 
How To Start Your Introduction For A Research Paper. How To Write
How To Start Your Introduction For A Research Paper. How To WriteHow To Start Your Introduction For A Research Paper. How To Write
How To Start Your Introduction For A Research Paper. How To Write
 
Custom Admission Essay Dnp A Writing Service Wi
Custom Admission Essay Dnp A Writing Service WiCustom Admission Essay Dnp A Writing Service Wi
Custom Admission Essay Dnp A Writing Service Wi
 
Blank Torn White Paper Template Premium Image
Blank Torn White Paper Template Premium ImageBlank Torn White Paper Template Premium Image
Blank Torn White Paper Template Premium Image
 
Green, Yellow, Red The Keys To The Perfect Persua
Green, Yellow, Red The Keys To The Perfect PersuaGreen, Yellow, Red The Keys To The Perfect Persua
Green, Yellow, Red The Keys To The Perfect Persua
 
FCE Exam Writing Samples - My Hometown Essay Writi
FCE Exam Writing Samples - My Hometown Essay WritiFCE Exam Writing Samples - My Hometown Essay Writi
FCE Exam Writing Samples - My Hometown Essay Writi
 
Referencing Essay
Referencing EssayReferencing Essay
Referencing Essay
 
How To Teach Opinion Writing Tips And Resources Artofit
How To Teach Opinion Writing Tips And Resources ArtofitHow To Teach Opinion Writing Tips And Resources Artofit
How To Teach Opinion Writing Tips And Resources Artofit
 
Fantasy Space Writing Paper By Miss Cleve Tea
Fantasy Space Writing Paper By Miss Cleve TeaFantasy Space Writing Paper By Miss Cleve Tea
Fantasy Space Writing Paper By Miss Cleve Tea
 

Recently uploaded

Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Recently uploaded (20)

9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 

An Analysis On Clustering Algorithms In Data Mining

  • 1. Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340 © 2014, IJ CS MCAll RightsReserved 334 Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320–088X IJCSMC, Vol. 3, Issue. 1, January 2014, pg.334 – 340 RESEARCH ARTICLE An Analysis on Clustering Algorithms in Data Mining Mythili S1 , Madhiya E2 Assistant Professor, PSGR Krishnammal College for Women, Coimbatore1 Assistant Professor, PSGR Krishnammal College for Women, Coimbatore2 smythisri@gmail.com, emadhiya@gmail.com Abstract: Clustering is the grouping together of similar data items into clusters. Clustering analysis is one of the main analytical methods in data mining; the method of clustering algorithm will influence the clustering results directly. This paper discusses the various types of algorithms like k-means clustering algorithms, etc…. and analyzes the advantages and shortcomings of the various algorithms. In each type we can calculate the distance between each data object and all cluster centers in each iteration, which makes the efficiency of clustering is not high. This paper provides a broad survey of the most basic techniques and identifies .This paper also deals with the issues of clustering algorithm such as time complexity and accuracy to provide the better results based on various environments. The results are discussed on huge datasets. Keywords: Clustering; datasets; machine-learning; deterministic I INTRODUCTION A large number of clustering definitions can be found in the literature, from simple to elaborate. The simplest definition is shared among all and includes one fundamental concept: the grouping together of similar data items into clusters. Cluster analysis is the organization of a collection of patterns (usually represented as a vector of measurements,or a point in a multidimensional space) into clusters based on similarity. It is important to understand the difference between clustering (unsupervised classification) and discriminant analysis (supervised classification). In supervised classification, we are provided with a collection of labeled (preclassified) patterns [1]; the problem is to label a newly encountered, yet
  • 2. Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340 © 2014, IJ CS MCAll RightsReserved 335 unlabeled, pattern. Typically, the given labeled (training) patterns are used to learn the descriptions of classes which in turn are used to label a new pattern. In the case of clustering, the problem is to group a given collection of unlabeled patterns into meaningful clusters. In a sense, labels are associated with clusters also, but these category labels are data driven; that is, they are obtained solely from the data. Clustering is useful in several exploratory pattern-analysis, grouping, decision-making, and machine-learning situations, including data mining, document retrieval, image segmentation, and pattern classification. However, in many such problems, there is little prior information (e.g., statistical models) available about the data, and the decision-maker must make as few assumptions about the data as possible. It is under these restrictions that clustering methodology is particularly appropriate for the exploration of interrelationships among the data points to make an assessment (perhaps preliminary) of their structure. The term “clustering” is used in several research communities to describe methods for grouping of unlabeled data[2]. These communities have different terminologies and assumptions for the components of the clustering process and the contexts in which clustering is used. Thus, we face a dilemma regarding the scope of this survey. The production of a truly comprehensive survey would be a monumental task given the sheer mass of literature in this area. The accessibility of the survey might also be questionable given the need to reconcile very different vocabularies and assumptions regarding clustering in the various communities [3]. The goal of this paper is to survey the core concepts and techniques in the large subset of cluster analysis with its roots in statistics and decision theory. Where appropriate, references will be made to key concepts and techniques arising from clustering methodology in the machine-learning and other communities. Figure 1: Stages in Clustering Typical pattern clustering activity involves the following steps [Jain and Dubes 1988]: (1) pattern representation (optionally including feature extraction and/or selection), (2) definition of a pattern proximity measure appropriate to the data domain, (3) clustering or grouping, (4) data abstraction (if needed), and (5) assessment of output (if needed). II LITERATURE SURVEY Different approaches to clustering data can be described with the help of the hierarchy (other taxonometric representations of clustering methodology are possible; ours is based on the discussion in Jain and Dubes[1988]). At the top level, there is a distinction between hierarchical and partitional approaches (hierarchical methods produce a nested series of partitions, while partitional methods produce only one). —Agglomerative vs. divisive: This aspect relates to algorithmic structure and operation. An agglomerative approach[4] begins with each pattern in a distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. A divisive method begins with all patterns in a single cluster and performs splitting until a stopping criterion is met. —Monothetic vs. polythetic: This aspect relates to the sequential or simultaneous use of features in the clusteringprocess. Most algorithms are polythetic; that is, all features enter into the computation of distances between
  • 3. Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340 © 2014, IJ CS MCAll RightsReserved 336 patterns [5], and decisions are based on those distances. A simple monothetic algorithm reported in Anderberg [1973] considers features sequentially to divide the given collection of patterns.[14]Here, the collection is divided into two groups using feature x1; the vertical broken line V is the separating line. Each of these clusters is further divided independently using feature x2, as depicted by the broken lines H1 and H2. The major problem with this algorithm is that it generates 2d clusters where d is the dimensionality of the patterns. For large values of d (d . 100 is typical in information retrieval applications [Salton 1991]),the number of clusters generated by this algorithm is so large that the data set is divided into uninterestingly small and fragmented clusters. —Hard vs. fuzzy: A hard clustering algorithm allocates each pattern to a single cluster during its operation and in its output. A fuzzy clustering method assigns degrees of membership in several clusters to each input pattern. A fuzzy clustering can be converted to a hard clustering by assigning each pattern to the cluster with the largest measure of membership. —Deterministic vs. stochastic: This issue is most relevant to partitional approaches designed to optimize a squared error function. This optimization can be accomplished using traditional techniques or through a random search[6] of the state space consisting of all possible labelings. Hierarchical Clustering Algorithms A representative algorithm of this kind is hierarchical clustering, which is implemented in the popular numerical software MATLAB [15]. This algorithm is an agglomerative algorithm that has several variations depending on the metric used to measure the distances among the clusters. The Euclidean distance is usually used for individual points . There are no known criteria of which clustering distance should be used, and it seems to depend strongly on the dataset. Among the most used variations of the hierarchical clustering based on different distance measures are [16]: 1. Average linkage clustering The dissimilarity between clusters is calculated using average values.The average distance is calculated from the distance between each point in a cluster and all other points in another cluster. The two clusters with the lowest average distance are joined together to form the newcluster. 2. Centroid linkage clustering This variation uses the group centroid as the average. The centroid is defined as the center of a cloud of points. 3. Complete linkage clustering (Maximum or Furthest-Neighbor Method)The dissimilarity between 2 groups is equal to the greatest dissimilarity between a member of cluster i and a member of cluster j. This method tends to produce very tight clusters of similar cases. 4. Single linkage clustering (Minimum or Nearest-Neighbor Method): The dissimilarity between 2 clusters is the minimum dissimilarity between members of the two clusters. This method produces long chains whichform loose, straggly clusters. 5. Ward's Method: Cluster membership is assigned by calculating the total sum of squared deviations from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares.
  • 4. Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340 © 2014, IJ CS MCAll RightsReserved 337 Figure 2: Centroid linkage clustering Single linkage clustering algorithm Let be D(i; j) the distance between clusters i ,in this case defined as was describe above, and j and N(i) the nearest neighbor of cluster i[7]. 1. Initialize as many clusters as data points 2. For each pair of clusters (i; j) compute D(i; j) 3. For each cluster i compute N(i) 4. Repeat until obtain the desired number of clusters (a) Determine i; j such that D(i; j) is minimized (b) Agglomerate cluster i and j (c) Update each D(i; j) and N(i) as necessary 5. End of repeat Partitional Algorithms A partitional clustering algorithm obtains a single partition of the data instead of a clustering structure, such as the dendrogram produced by a hierarchical technique. Partitional methods have advantages in applications involving large data sets for which the construction of a dendrogram is computationally prohibitive[8]. A problem accompanying the use of a partitional algorithm is the choice of the number of desired output clusters. A seminal paper [Dubes 1987] provides guidance on this key design decision. The partitional techniques usually produce clusters by optimizing a criterion function defined either locally (on a subset of the patterns) or globally (defined over all of the patterns). Combinatorial search of the set of possible labelings for an optimum value of a criterion is clearly computationally prohibitive. In practice, therefore, the algorithm is typically run multiple times with different starting states, and the best configuration obtained from all of the runs is used as the output clustering. Nearest Neighbor Clustering Since proximity plays a key role in our intuitive notion of a cluster, nearest neighbor distances can serve as the basis of clustering procedures. An iterative procedure was proposed in Lu and Fu [1978]; it assigns each unlabeled pattern to the cluster of its nearest labeled neighbor pattern, provided the distance to that labeled neighbor is below a threshold [9]. The process continues until all patterns are labeled or no additional labelings occur. The mutual neighborhood value (described earlier in the context of distance computation) can also be used to grow clusters from near neighbors.
  • 5. Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340 © 2014, IJ CS MCAll RightsReserved 338 Fuzzy Clustering Traditional clustering approaches generate partitions; in a partition, each pattern belongs to one and only one cluster. Hence, the clusters in a hard clustering are disjoint. Fuzzy clustering extends this notion to associate each pattern[13] with every cluster using a membership function [Zadeh 1965]. The output of such algorithms is a clustering, but not a partition. Figure 4: Fuzzy Clustering III A Comparison of Techniques In this section we have examined various deterministic and stochastic search techniques to approach the clustering problem as an optimization problem. A majority of these methods use the squared error criterion function. Hence, the partitions generated by these approaches are not as versatile as those generated by hierarchical algorithms. The clusters generated are typically hyperspherical in shape. Evolutionary approaches are globalized search techniques, whereas the rest of the approaches are localized search technique [10]. ANNs and GAs are inherently parallel, so they can be implemented using parallel hardware to improve their speed. Evolutionary approaches are population-based; that is, they search using more than one solution at a time, and the rest are based on using a single solution at a time. ANNs, GAs, SA, and Tabu search (TS) are all sensitive to the selection of various learning/control parameters. In theory, all four of these methods are weak methods [Rich 1983] in that they do not use explicit domain knowledge. An important feature of the evolutionary approaches is that they can find the optimal solution even when the criterion function is discontinuous [11]. K-means algorithm The K-means algorithm, probably the first one of the clustering algorithms proposed, is based on a very simple idea: Given a set of initial clusters, assign each point to one of them, then each cluster center is replaced by the mean point on the respective cluster [17]. These two simple steps are repeated until convergence. A point is assigned to the cluster which is close in Euclidean distance to the point. Although K-means has the great advantage of being easy to implement, it has two big drawbacks [12]. First, it can be really slow since in each step the distance between each point to each cluster has to be calculated, which can be really expensive in the presence of a large dataset. Second, this method is really sensitive to the provided initial clusters, however, in recent years, this problem has been addressed with some degree of success.
  • 6. Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340 © 2014, IJ CS MCAll RightsReserved 339 IV Conclusions and Future Directions Given a data set, the ideal scenario would be to have a given set of criteria choose a proper clustering algorithm to apply. Choosing a clustering algorithm, however, can be a difficult task. Even finding just the most relevant approaches for a given data set is hard. Most of the algorithms generally assume some implicit structure in the data set. The problem, however, is that usually you have little or no information regarding the structure, which is, paradoxically, what you want to uncover. The worst case would be one in which previous information about the data or the clusters is unknown, and a process of trial and error is the best option. However, there are many elements that are usually known, and can be helpful in choosing an algorithm. One of the most important elements is the nature of the data and the nature of the desired cluster. Another issue to keep in mind is the kind of input and tools that the algorithm requires. For example, some algorithms use numerical inputs, some use categorical inputs; some require a definition of a distance or similarity measures for the data. The size of the data set is also important to keep in mind, because most of the clustering algorithms require multiple data scans to achieve convergence, a good discussion of this problems. An additional issue related to selecting an algorithm is correctly choosing the initial set of clusters. As was shown in the numerical results, an adequate choice of clusters can strongly influence both the quality of and the time required to obtain a solution. Also important is that some clustering methods, such as hierarchical clustering, need a distance matrix which contains all the distances between every pair of elements in the data set. While these methods rely on simplicity, the size of this matrix is of the size m2, which can be prohibitive due to memory constraints, as was shown in the experiments. Recently this issue has been addressed, resulting in new variations of hierarchical and reciprocal nearest neighbor clustering. This paper provides a broad survey of the most basic techniques. REFERENCES [1] L. Parsons, E. Haque, and H. Liu, "Subspace clustering forhigh dimensional data: a review," ACM SIGKDDExplorations Newsletter, vol. 6, pp. 90-105, 2004. [2] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, "Aframework for projected clustering of high dimensional datastreams," in Proceedings of the Thirtieth internationalconference on Very large data bases-Volume 30, 2004, p.863. [3] R. Agrawal, J. E. Gehrke, D. Gunopulos, and P. Raghavan,"Automatic subspace clustering of high dimensional data fordata mining applications," Google Patents, 1999. [4] C. Domeniconi, D. Papadopoulos, D. Gunopulos, and S.Ma, "Subspace clustering of high dimensional data," 2004. [5] X. Z. Fern and C. E. Brodley, "Random projection for highdimensional data clustering: A cluster ensemble approach,"2003, p. 186. [6] H. P. Kriegel, P. Kröger, and A. Zimek, "Clusteringhigh-dimensional data: A survey on subspace clustering,pattern-based clustering, and correlation clustering," 2009. [7] R. XU and I. Donald C. Wunsch, clustering: A Johnwiley & Sons, INC., Pub, 2008. [8] Guha, Meyerson, A. Mishra, N. Motwani, and O. C. ."Clustering data streams: Theory and practice ." IEEETransactions on Knowledge and Data Engineering, vol. 15,pp. 515-528, 2003. [9] A. Jain , M. Murty , and p. Flynn " Data clustering: A review.," ACM Computing Surveys, vol. 31, pp. 264-323, 1999. [10] P. C. Biswal, Discrete Mathematics and Graph Theory. New Delhi: Prentice Hall of India, 2005. [11] M. Khalilian, Discrete Mathematics Structure. Karaj: Sarafraz, 2004. [12] K. J. Cios, W. Pedrycz, and R. M. Swiniarsk, "Data miningmethods for knowledge discovery," IEEE Transactions on Neural Networks, vol. 9, pp. 1533-1534, 1998. [13] V. V. Raghavan and K. Birchard, "A clustering strategy based on a formalism of the reproductive process in natural systems," 1979, pp. 10-22.
  • 7. Mythili S et al, International Journal of Computer Science and Mobile Computing, Vol.3 Issue.1, January- 2014, pg. 334-340 © 2014, IJ CS MCAll RightsReserved 340 [14] G. Salton and C. Buckley, "Term weighting approaches in automatic text retrieval," Inf Process Manage, vol. 24, pp.513-523, 1987. [15] G. Salton, "Automatic text processing: the transformation,"Analysis and Retrieval of Information by Computer, 1989. [16] D. Isa, V. P. Kallimani, and L. H. Lee, "Using the self organizing map for clustering of text documents," Expert Systems With Applications, vol. 36, pp. 9584-9591, 2009 [17] Shi Na,Liu Xumin, “Research on k-means Clustering Algorithm”,IEEE Third International Conference on Intelligent Information Technology and Security Informatics,2010.