SlideShare a Scribd company logo
1 of 8
Download to read offline
International Journal of Engineering Science Invention
ISSN (Online): 2319 – 6734, ISSN (Print): 2319 – 6726
www.ijesi.org Volume 2 Issue 7 ǁ July. 2013 ǁ PP.01-08
www.ijesi.org 1 | Page
Multidimensional Clustering Methods of Data Mining for
Industrial Applications
Ashok Kumar Thavani Andu1
, Dr. Antony Selvdoss Thanamani2
1
Assistant Professor, CMS College, Coimabtore.Tamilnadu.INDIA
2
Professor & Head, NGM College, Pollachi. Tamilnadu.India
ABSTRACT: Clustering is the unsupervised classification of patterns (observations, data items, or feature
vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers
in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data
analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and
contexts in different communities have made the transfer of useful generic concepts and methodologies slow to
occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition
perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the
broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify
cross-cutting themes and recent advances. We also describe some important applications of clustering
algorithms such as image segmentation, object recognition, and information retrieval.
I. INTRODUCTION
Clustering can be considered the most important unsupervised, earning problem; so, as every other
problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of
clustering could be “the process of organizing objects into groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the
objects belonging to other clusters.
We can show this with a simple graphical example:
In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion
is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in
this case geometrical distance). This is called distance-based clustering.
Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if
this one defines a concept common to all that objects. In other words, objects are grouped according to their fit
to descriptive concepts, not according to simple similarity measures.
II. THE GOALS OF CLUSTERING
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to
decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would
be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in
such a way that the result of the clustering will suit their needs
Possible Applications
clustering algorithms can be applied in many fields, for instance:
 Marketing: finding groups of customers with similar behavior given a large database of customer data
containing their properties and past buying records;
 Biology: classification of plants and animals given their features;
 Libraries: book ordering;
Multidimensional Clustering Methods...
www.ijesi.org 2 | Page
 Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying
frauds;
 City-planning: identifying groups of houses according to their house type, value and geographical location;
 Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
 WWW: document classification; clustering weblog data to discover groups of similar access patterns.
Requirements
the main requirements that a clustering algorithm should satisfy are:
 scalability;
 dealing with different types of attributes;
 discovering clusters with arbitrary shape;
 minimal requirements for domain knowledge to determine input parameters;
 ability to deal with noise and outliers;
 insensitivity to order of input records;
 high dimensionality;
 Interpretability and usability.
Problems
There are a number of problems with clustering. Among them:
 current clustering techniques do not address all the requirements adequately (and concurrently);
 dealing with large number of dimensions and large number of data items can be problematic because of
time complexity;
 the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);
 if an obvious distance measure doesn‟t exist we must “define” it, which is not always easy, especially in
multi-dimensional spaces;
 the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in
different ways.
Clustering Algorithms
Classification
clustering algorithms may be classified as listed below:
 Exclusive Clustering
 Overlapping Clustering
 Hierarchical Clustering
 Probabilistic Clustering
In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite
cluster then it could not be included in another cluster. A simple example of that is shown in the figure below,
where the separation of points is achieved by a straight line. On contrary the second type, the overlapping
clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different
degrees of membership. In this case, data will be associated to an appropriate membership value.
Instead, a hierarchical clustering algorithm is based on the union between the two nearest clusters. The
beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final
clusters wanted.
Multidimensional Clustering Methods...
www.ijesi.org 3 | Page
Finally, the last kind of clustering uses a completely probabilistic approach.
In this tutorial we propose four of the most used clustering algorithms:
 K-means
 Fuzzy C-means
 Hierarchical clustering
 Mixture of Gaussians
III. DISTANCE MEASURE
An important component of a clustering algorithm is the distance measure between data points. If the
components of the data instance vectors are all in the same physical units then it is possible that the simple
Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case
the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the
width and height measurements of an object. Despite both measurements being taken in the same physical units,
an informed decision has to be made as to the relative scaling. As the figure shows, different scaling can lead to
different clustering.
Notice however that this is not only a graphic issue: the problem arises from the mathematical formula
used to combine the distances between the single components of the data feature vectors into a unique distance
measure that can be used for clustering purposes: different formulas leads to different clusterings.
Again, domain knowledge must be used to guide the formulation of a suitable distance measure for each
particular application.
IV. MINKOWSKI METRIC
For higher dimensional data, a popular measure is the Minkowski metric,
Where d is the dimensionality of the data. The Euclidean distance is a special case where p=2,
while Manhattan metric has p=1. However, there are no general theoretical guidelines for selecting a measure
for any given application.
It is often the case that the components of the data feature vectors are not immediately comparable. It
can be that the components are not continuous variables, like length, but nominal categories, such as the days of
the week. In these cases again, domain knowledge must be used to formulate an appropriate measure.
The Algorithm K- means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms
that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given
data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k
centroids, one for each cluster. These centroids should be placed in a cunning way because of different location
causes different result. So, the better choice is to place them as much as possible far away from each other. The
next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no
point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate
k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new
centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop
has been generated. As a result of this change their location step by step until no more changes are done. In
other words centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error function.
The objective function
Multidimensional Clustering Methods...
www.ijesi.org 4 | Page
,
Where is a chosen distance measure between a data point and the cluster centre ,
is an indicator of the distance of the n data points from their respective cluster centers.
The algorithm is composed of the following steps:
Although it can be proved that the procedure will always terminate, the k-means algorithm does not
necessarily find the most optimal configuration, corresponding to the global objective function minimum. The
algorithm is also significantly sensitive to the initial randomly selected cluster centres. The k-means algorithm
can be run multiple times to reduce this effect.
K-means is a simple algorithm that has been adapted to many problem domains. As we are going to
see, it is a good candidate for extension to work with fuzzy feature vectors.
Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we know that
they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are well
separated, we can use a minimum-distance classifier to separate them. That is, we can say that x is in cluster i if
|| x - mi || is the minimum of all the k distances. This suggests the following procedure for finding the k means:
 Make initial guesses for the means m1, m2, ..., mk
 Until there are no changes in any mean
 Use the estimated means to classify the samples into clusters
 For i from 1 to k
 Replace mi with the mean of all of the samples for cluster i
 end_for
 end_until
Here is an example showing how the means m1 and m2 move into the centers of two clusters.
Remarks
This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm for
partitioning the n samples into k clusters so as to minimize the sum of the squared distances to the cluster
centers. It does have some weaknesses:
 The way to initialize the means was not specified. One popular way to start is to randomly choose k of the
samples.
 The results produced depend on the initial values for the means, and it frequently happens that suboptimal
partitions are found. The standard solution is to try a number of different starting points.
 It can happen that the set of samples closest to mi is empty, so that mi cannot be updated. This is an
annoyance that must be handled in an implementation, but that we shall ignore.
 The results depend on the metric used to measure || x - mi ||. A popular solution is to normalize each
variable by its standard deviation, though this is not always desirable.
 The results depend on the value of k.
This last problem is particularly troublesome, since we often have no way of knowing how many
clusters exist. In the example shown above, the same algorithm applied to the same data produces the following
3-means clustering. Is it better or worse than the 2-means clustering?
Multidimensional Clustering Methods...
www.ijesi.org 5 | Page
Unfortunately there is no general theoretical solution to find the optimal number of clusters for any
given data set. A simple approach is to compare the results of multiple runs with different k classes and choose
the best one according to a given criterion (for instance the Schwarz Criterion - see Moore's slides), but we need
to be careful because increasing k results in smaller error function values by definition, but also an increasing
risk of overfitting.
V. FUZZY C-MEANS CLUSTERING
The Algorithm Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to
belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is
frequently used in pattern recognition. It is based on minimization of the following objective function:
,
where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is
the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm
expressing the similarity between any measured data and the center.
Fuzzy partitioning is carried out through an iterative optimization of the objective function shown
above, with the update of membership uij and the cluster centers cj by:
,
This iteration will stop when , where is a termination criterion between
0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point ofJm.
The algorithm is composed of the following steps:
 Initialize U=[uij] matrix, U(0)
 At k-step: calculate the centers vectors
C(k)
=[cj] with U(k)
 Update U(k)
, U(k+1)
Multidimensional Clustering Methods...
www.ijesi.org 6 | Page
 If || U(k+1)
- U(k)
||< then STOP; otherwise
return to step 2.
Remarks
As already told, data are bound to each cluster by means of a Membership Function, which represents
the fuzzy behavior of this algorithm. To do that, we simply have to build an appropriate matrix named U whose
factors are numbers between 0 and 1, and represent the degree of membership between data and centers of
clusters
For a better understanding, we may consider this simple mono-dimensional example. Given a certain
data set, suppose to represent it as distributed on an axis.
Looking at the picture, we may identify two clusters in proximity of the two data concentrations. We
will refer to them using „A‟ and „B‟. In the first approach shown in this tutorial - the k-means algorithm - we
associated each datum to a specific cancroids; therefore, this membership function looked like this:
In the FCM approach, instead, the same given datum does not belong exclusively to a well defined
cluster, but it can be placed in a middle way. In this case, the membership function follows a smoother line to
indicate that every datum may belong to several clusters with different values of the membership coefficient.
In the figure above, the datum shown as a red marked spot belongs more to the B cluster rather than the
A cluster. The value 0.2 of „m‟ indicates the degree of membership to A for such datum. Now, instead of using a
graphical representation, we introduce a matrix U whose factors are the ones taken from the membership
functions:
Multidimensional Clustering Methods...
www.ijesi.org 7 | Page
(a) (b)
The number of rows and columns depends on how many data and clusters we are considering. More
exactly we have C = 2 columns (C = 2 clusters) and N rows, where C is the total number of clusters and N is the
total number of data. The generic element is so indicated: uij.
In the examples above we have considered the k-means (a) and FCM (b) cases. We can notice that in
the first case (a) the coefficients are always unitary. It is so to indicate the fact that each datum can belong only
to one cluster. Other properties are shown below:


An Example
Here, we consider the simple case of a mono-dimensional application of the FCM. Twenty data and
three clusters are used to initialize the algorithm and to compute the U matrix. Figures below (taken from
our interactive demo) show the membership value for each datum and for each cluster. The color of the data is
that of the nearest cluster according to the membership function.
In the simulation shown in the figure above we have used a fuzzyness coefficient m = 2 and we have also
imposed to terminate the algorithm when . The picture shows the initial
condition where the fuzzy distribution depends on the particular position of the clusters. No step is performed
yet so that clusters are not identified very well. Now we can run the algorithm until the stop condition is
verified. The figure below shows the final condition reached at the 8th step with m=2 and =0.3:
Multidimensional Clustering Methods...
www.ijesi.org 8 | Page
Is it possible to do better? Certainly, we could use an higher accuracy but we would have also to pay
for a bigger computational effort. In the next figure we can see a better result having used the same initial
conditions and =0.01, but we needed 37 steps!
It is also important to notice that different initializations cause different evolutions of the algorithm. In
fact it could converge to the same result but probably with a different number of iteration steps.
REFERENCES
[1] Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering”
[2] J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th
Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297
[3] Andrew Moore: “K-means and Hierarchical Clustering - Tutorial Slides”
[4] J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated
Clusters", Journal of Cybernetics 3: 32-57
[5] J. C. Bezdek (2005): "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York
[6] Tariq Rashid: “Clustering”.
[7] Hans-Joachim Mucha and Hizir Sofyan: “Nonhierarchical Clustering”
[8] 12 Karp, R. M., Rabin M. O.: Efficient Randomized Pattern-Matching Algorithms. IBM
[9] Journal of Research and Development, vol. 31(2), pp. 249-260 (1987)
[10] 13 Li, B., Xu, S., Zhang, J.: Enhancing Clustering Blog Documents by Utilizing Author/Reader
[11] Comments, ACM Southeast Regional Conference, pp. 94-99 (2007)
[12] [Sholom et al. 2004] Sholom, M., Indurkhya, N., Zhang, T., Damerau, F. Text Mining , Predictive Methods for Analyzing
Unstructured Information. Springer, 2004.
[13] [Zhong 2006] Zhong, S. Semi-supervised Model-based Document Clustering: A Comparative Study. Machine Learning, (65) 1,
2006.

More Related Content

What's hot

Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringDr Nisha Arora
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringIJERD Editor
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...IRJET Journal
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Alexander Decker
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Editor IJARCET
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAEditor Jacotech
 
Clustering techniques final
Clustering techniques finalClustering techniques final
Clustering techniques finalBenard Maina
 
Improved probabilistic distance based locality preserving projections method ...
Improved probabilistic distance based locality preserving projections method ...Improved probabilistic distance based locality preserving projections method ...
Improved probabilistic distance based locality preserving projections method ...IJECEIAES
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detectionColleen Farrelly
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
 

What's hot (17)

Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Ensemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes ClusteringEnsemble based Distributed K-Modes Clustering
Ensemble based Distributed K-Modes Clustering
 
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
Machine Learning Algorithms for Image Classification of Hand Digits and Face ...
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147Volume 2-issue-6-2143-2147
Volume 2-issue-6-2143-2147
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
A Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCAA Novel Algorithm for Design Tree Classification with PCA
A Novel Algorithm for Design Tree Classification with PCA
 
Clustering techniques final
Clustering techniques finalClustering techniques final
Clustering techniques final
 
Clustering
ClusteringClustering
Clustering
 
Improved probabilistic distance based locality preserving projections method ...
Improved probabilistic distance based locality preserving projections method ...Improved probabilistic distance based locality preserving projections method ...
Improved probabilistic distance based locality preserving projections method ...
 
Quantum persistent k cores for community detection
Quantum persistent k cores for community detectionQuantum persistent k cores for community detection
Quantum persistent k cores for community detection
 
Finding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster ResultsFinding Relationships between the Our-NIR Cluster Results
Finding Relationships between the Our-NIR Cluster Results
 
My8clst
My8clstMy8clst
My8clst
 

Viewers also liked

R User Group Ian Simpson200110
R User Group Ian Simpson200110R User Group Ian Simpson200110
R User Group Ian Simpson200110Ian Simpson
 
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...Symeon Papadopoulos
 
Promotion schedule
Promotion schedulePromotion schedule
Promotion schedulevasu_goel
 
ENVOGUE - Energy Trending!
ENVOGUE - Energy Trending!ENVOGUE - Energy Trending!
ENVOGUE - Energy Trending!Umesh Bhutoria
 
Projections Update
Projections UpdateProjections Update
Projections UpdateFaheem Ahmed
 
Switch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOW
Switch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOWSwitch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOW
Switch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOWRicardo Tomé
 
Kutu kutu buku
Kutu kutu bukuKutu kutu buku
Kutu kutu bukuknayanawaw
 
The Mechanics of Social Media
The Mechanics of Social MediaThe Mechanics of Social Media
The Mechanics of Social MediaMatthew Gerrior
 
Data Collection Using the Juniper Systems Archer 2
Data Collection Using the Juniper Systems Archer 2Data Collection Using the Juniper Systems Archer 2
Data Collection Using the Juniper Systems Archer 2COGS Presentations
 
Discharge report
Discharge reportDischarge report
Discharge reportVandit Jain
 
Cross Media Capabilities
Cross Media CapabilitiesCross Media Capabilities
Cross Media Capabilitiestpandolfo
 
Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)
Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)
Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)Dr Arindam Basu
 
2014 jun online_game_a_dspending
2014 jun online_game_a_dspending2014 jun online_game_a_dspending
2014 jun online_game_a_dspendingTaboola
 
Dreams
DreamsDreams
DreamsR J
 
Metzger MS defense
Metzger MS defenseMetzger MS defense
Metzger MS defensedmetzger1
 

Viewers also liked (19)

R User Group Ian Simpson200110
R User Group Ian Simpson200110R User Group Ian Simpson200110
R User Group Ian Simpson200110
 
Agm expansions
Agm expansionsAgm expansions
Agm expansions
 
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...
EventSense: Capturing the Pulse of Large-scale Events by Mining Social Media ...
 
32 99-1-pb
32 99-1-pb32 99-1-pb
32 99-1-pb
 
Promotion schedule
Promotion schedulePromotion schedule
Promotion schedule
 
ENVOGUE - Energy Trending!
ENVOGUE - Energy Trending!ENVOGUE - Energy Trending!
ENVOGUE - Energy Trending!
 
Are you mdm aware
Are you mdm awareAre you mdm aware
Are you mdm aware
 
Projections Update
Projections UpdateProjections Update
Projections Update
 
Switch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOW
Switch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOWSwitch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOW
Switch 2010 - 5 Para a Meia-Noite / CROSSMEDIA SHOW
 
Kutu kutu buku
Kutu kutu bukuKutu kutu buku
Kutu kutu buku
 
The Mechanics of Social Media
The Mechanics of Social MediaThe Mechanics of Social Media
The Mechanics of Social Media
 
Data Collection Using the Juniper Systems Archer 2
Data Collection Using the Juniper Systems Archer 2Data Collection Using the Juniper Systems Archer 2
Data Collection Using the Juniper Systems Archer 2
 
Discharge report
Discharge reportDischarge report
Discharge report
 
Cross Media Capabilities
Cross Media CapabilitiesCross Media Capabilities
Cross Media Capabilities
 
Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)
Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)
Presentation on UDHC for UC Tertiary Engagement Summit (Draft Slides)
 
Ipl 7 game of brands week 2
Ipl 7 game of brands  week 2Ipl 7 game of brands  week 2
Ipl 7 game of brands week 2
 
2014 jun online_game_a_dspending
2014 jun online_game_a_dspending2014 jun online_game_a_dspending
2014 jun online_game_a_dspending
 
Dreams
DreamsDreams
Dreams
 
Metzger MS defense
Metzger MS defenseMetzger MS defense
Metzger MS defense
 

Similar to Multidimensional Clustering Methods for Industrial Applications

Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal
 
clustering ppt.pptx
clustering ppt.pptxclustering ppt.pptx
clustering ppt.pptxchmeghana1
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithmijsrd.com
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptionsrefedey275
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSZac Darcy
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersZac Darcy
 
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...A Density Based Clustering Technique For Large Spatial Data Using Polygon App...
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...IOSR Journals
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...IJECEIAES
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureIOSR Journals
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsIJMER
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithmLaura Petrosanu
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)Pratik Meshram
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxSureshPolisetty2
 

Similar to Multidimensional Clustering Methods for Industrial Applications (20)

Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
clustering ppt.pptx
clustering ppt.pptxclustering ppt.pptx
clustering ppt.pptx
 
A survey on Efficient Enhanced K-Means Clustering Algorithm
 A survey on Efficient Enhanced K-Means Clustering Algorithm A survey on Efficient Enhanced K-Means Clustering Algorithm
A survey on Efficient Enhanced K-Means Clustering Algorithm
 
47 292-298
47 292-29847 292-298
47 292-298
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
Unsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and AssumptionsUnsupervised learning Algorithms and Assumptions
Unsupervised learning Algorithms and Assumptions
 
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERSA MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
A MIXTURE MODEL OF HUBNESS AND PCA FOR DETECTION OF PROJECTED OUTLIERS
 
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected OutliersA Mixture Model of Hubness and PCA for Detection of Projected Outliers
A Mixture Model of Hubness and PCA for Detection of Projected Outliers
 
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...A Density Based Clustering Technique For Large Spatial Data Using Polygon App...
A Density Based Clustering Technique For Large Spatial Data Using Polygon App...
 
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
Clustering heterogeneous categorical data using enhanced mini  batch K-means ...Clustering heterogeneous categorical data using enhanced mini  batch K-means ...
Clustering heterogeneous categorical data using enhanced mini batch K-means ...
 
Clustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity MeasureClustering Algorithm with a Novel Similarity Measure
Clustering Algorithm with a Novel Similarity Measure
 
An Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data FragmentsAn Efficient Clustering Method for Aggregation on Data Fragments
An Efficient Clustering Method for Aggregation on Data Fragments
 
Du35687693
Du35687693Du35687693
Du35687693
 
8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm8.clustering algorithm.k means.em algorithm
8.clustering algorithm.k means.em algorithm
 
Dp33701704
Dp33701704Dp33701704
Dp33701704
 
Rohit 10103543
Rohit 10103543Rohit 10103543
Rohit 10103543
 
pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)pratik meshram-Unit 5 (contemporary mkt r sch)
pratik meshram-Unit 5 (contemporary mkt r sch)
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
For iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptxFor iiii year students of cse ML-UNIT-V.pptx
For iiii year students of cse ML-UNIT-V.pptx
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Multidimensional Clustering Methods for Industrial Applications

  • 1. International Journal of Engineering Science Invention ISSN (Online): 2319 – 6734, ISSN (Print): 2319 – 6726 www.ijesi.org Volume 2 Issue 7 ǁ July. 2013 ǁ PP.01-08 www.ijesi.org 1 | Page Multidimensional Clustering Methods of Data Mining for Industrial Applications Ashok Kumar Thavani Andu1 , Dr. Antony Selvdoss Thanamani2 1 Assistant Professor, CMS College, Coimabtore.Tamilnadu.INDIA 2 Professor & Head, NGM College, Pollachi. Tamilnadu.India ABSTRACT: Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities have made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval. I. INTRODUCTION Clustering can be considered the most important unsupervised, earning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. We can show this with a simple graphical example: In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures. II. THE GOALS OF CLUSTERING The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs Possible Applications clustering algorithms can be applied in many fields, for instance:  Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;  Biology: classification of plants and animals given their features;  Libraries: book ordering;
  • 2. Multidimensional Clustering Methods... www.ijesi.org 2 | Page  Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;  City-planning: identifying groups of houses according to their house type, value and geographical location;  Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;  WWW: document classification; clustering weblog data to discover groups of similar access patterns. Requirements the main requirements that a clustering algorithm should satisfy are:  scalability;  dealing with different types of attributes;  discovering clusters with arbitrary shape;  minimal requirements for domain knowledge to determine input parameters;  ability to deal with noise and outliers;  insensitivity to order of input records;  high dimensionality;  Interpretability and usability. Problems There are a number of problems with clustering. Among them:  current clustering techniques do not address all the requirements adequately (and concurrently);  dealing with large number of dimensions and large number of data items can be problematic because of time complexity;  the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);  if an obvious distance measure doesn‟t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces;  the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways. Clustering Algorithms Classification clustering algorithms may be classified as listed below:  Exclusive Clustering  Overlapping Clustering  Hierarchical Clustering  Probabilistic Clustering In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line. On contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value. Instead, a hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted.
  • 3. Multidimensional Clustering Methods... www.ijesi.org 3 | Page Finally, the last kind of clustering uses a completely probabilistic approach. In this tutorial we propose four of the most used clustering algorithms:  K-means  Fuzzy C-means  Hierarchical clustering  Mixture of Gaussians III. DISTANCE MEASURE An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scaling can lead to different clustering. Notice however that this is not only a graphic issue: the problem arises from the mathematical formula used to combine the distances between the single components of the data feature vectors into a unique distance measure that can be used for clustering purposes: different formulas leads to different clusterings. Again, domain knowledge must be used to guide the formulation of a suitable distance measure for each particular application. IV. MINKOWSKI METRIC For higher dimensional data, a popular measure is the Minkowski metric, Where d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1. However, there are no general theoretical guidelines for selecting a measure for any given application. It is often the case that the components of the data feature vectors are not immediately comparable. It can be that the components are not continuous variables, like length, but nominal categories, such as the days of the week. In these cases again, domain knowledge must be used to formulate an appropriate measure. The Algorithm K- means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function
  • 4. Multidimensional Clustering Methods... www.ijesi.org 4 | Page , Where is a chosen distance measure between a data point and the cluster centre , is an indicator of the distance of the n data points from their respective cluster centers. The algorithm is composed of the following steps: Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centres. The k-means algorithm can be run multiple times to reduce this effect. K-means is a simple algorithm that has been adapted to many problem domains. As we are going to see, it is a good candidate for extension to work with fuzzy feature vectors. Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are well separated, we can use a minimum-distance classifier to separate them. That is, we can say that x is in cluster i if || x - mi || is the minimum of all the k distances. This suggests the following procedure for finding the k means:  Make initial guesses for the means m1, m2, ..., mk  Until there are no changes in any mean  Use the estimated means to classify the samples into clusters  For i from 1 to k  Replace mi with the mean of all of the samples for cluster i  end_for  end_until Here is an example showing how the means m1 and m2 move into the centers of two clusters. Remarks This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm for partitioning the n samples into k clusters so as to minimize the sum of the squared distances to the cluster centers. It does have some weaknesses:  The way to initialize the means was not specified. One popular way to start is to randomly choose k of the samples.  The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The standard solution is to try a number of different starting points.  It can happen that the set of samples closest to mi is empty, so that mi cannot be updated. This is an annoyance that must be handled in an implementation, but that we shall ignore.  The results depend on the metric used to measure || x - mi ||. A popular solution is to normalize each variable by its standard deviation, though this is not always desirable.  The results depend on the value of k. This last problem is particularly troublesome, since we often have no way of knowing how many clusters exist. In the example shown above, the same algorithm applied to the same data produces the following 3-means clustering. Is it better or worse than the 2-means clustering?
  • 5. Multidimensional Clustering Methods... www.ijesi.org 5 | Page Unfortunately there is no general theoretical solution to find the optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion (for instance the Schwarz Criterion - see Moore's slides), but we need to be careful because increasing k results in smaller error function values by definition, but also an increasing risk of overfitting. V. FUZZY C-MEANS CLUSTERING The Algorithm Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimization of the following objective function: , where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center. Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by: , This iteration will stop when , where is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point ofJm. The algorithm is composed of the following steps:  Initialize U=[uij] matrix, U(0)  At k-step: calculate the centers vectors C(k) =[cj] with U(k)  Update U(k) , U(k+1)
  • 6. Multidimensional Clustering Methods... www.ijesi.org 6 | Page  If || U(k+1) - U(k) ||< then STOP; otherwise return to step 2. Remarks As already told, data are bound to each cluster by means of a Membership Function, which represents the fuzzy behavior of this algorithm. To do that, we simply have to build an appropriate matrix named U whose factors are numbers between 0 and 1, and represent the degree of membership between data and centers of clusters For a better understanding, we may consider this simple mono-dimensional example. Given a certain data set, suppose to represent it as distributed on an axis. Looking at the picture, we may identify two clusters in proximity of the two data concentrations. We will refer to them using „A‟ and „B‟. In the first approach shown in this tutorial - the k-means algorithm - we associated each datum to a specific cancroids; therefore, this membership function looked like this: In the FCM approach, instead, the same given datum does not belong exclusively to a well defined cluster, but it can be placed in a middle way. In this case, the membership function follows a smoother line to indicate that every datum may belong to several clusters with different values of the membership coefficient. In the figure above, the datum shown as a red marked spot belongs more to the B cluster rather than the A cluster. The value 0.2 of „m‟ indicates the degree of membership to A for such datum. Now, instead of using a graphical representation, we introduce a matrix U whose factors are the ones taken from the membership functions:
  • 7. Multidimensional Clustering Methods... www.ijesi.org 7 | Page (a) (b) The number of rows and columns depends on how many data and clusters we are considering. More exactly we have C = 2 columns (C = 2 clusters) and N rows, where C is the total number of clusters and N is the total number of data. The generic element is so indicated: uij. In the examples above we have considered the k-means (a) and FCM (b) cases. We can notice that in the first case (a) the coefficients are always unitary. It is so to indicate the fact that each datum can belong only to one cluster. Other properties are shown below:   An Example Here, we consider the simple case of a mono-dimensional application of the FCM. Twenty data and three clusters are used to initialize the algorithm and to compute the U matrix. Figures below (taken from our interactive demo) show the membership value for each datum and for each cluster. The color of the data is that of the nearest cluster according to the membership function. In the simulation shown in the figure above we have used a fuzzyness coefficient m = 2 and we have also imposed to terminate the algorithm when . The picture shows the initial condition where the fuzzy distribution depends on the particular position of the clusters. No step is performed yet so that clusters are not identified very well. Now we can run the algorithm until the stop condition is verified. The figure below shows the final condition reached at the 8th step with m=2 and =0.3:
  • 8. Multidimensional Clustering Methods... www.ijesi.org 8 | Page Is it possible to do better? Certainly, we could use an higher accuracy but we would have also to pay for a bigger computational effort. In the next figure we can see a better result having used the same initial conditions and =0.01, but we needed 37 steps! It is also important to notice that different initializations cause different evolutions of the algorithm. In fact it could converge to the same result but probably with a different number of iteration steps. REFERENCES [1] Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering” [2] J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297 [3] Andrew Moore: “K-means and Hierarchical Clustering - Tutorial Slides” [4] J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: 32-57 [5] J. C. Bezdek (2005): "Pattern Recognition with Fuzzy Objective Function Algorithms", Plenum Press, New York [6] Tariq Rashid: “Clustering”. [7] Hans-Joachim Mucha and Hizir Sofyan: “Nonhierarchical Clustering” [8] 12 Karp, R. M., Rabin M. O.: Efficient Randomized Pattern-Matching Algorithms. IBM [9] Journal of Research and Development, vol. 31(2), pp. 249-260 (1987) [10] 13 Li, B., Xu, S., Zhang, J.: Enhancing Clustering Blog Documents by Utilizing Author/Reader [11] Comments, ACM Southeast Regional Conference, pp. 94-99 (2007) [12] [Sholom et al. 2004] Sholom, M., Indurkhya, N., Zhang, T., Damerau, F. Text Mining , Predictive Methods for Analyzing Unstructured Information. Springer, 2004. [13] [Zhong 2006] Zhong, S. Semi-supervised Model-based Document Clustering: A Comparative Study. Machine Learning, (65) 1, 2006.