A loose definition of clustering could be: the
process of organising objects into groups whose
members are similar in some ways. Its task is
grouping of set of objects in such a way that
objects in same group are more similar to each
other than to objects in other groups.
The goal of clustering is to determine the
intrinsic grouping in a set of unlabeled data.
Thus, cluster analysis is sometimes referred to as
“unsupervised classification” and is distinct from
“supervised classification”, or more commonly
Hierarchical clustering- It is based on the core idea of objects
being more related to nearby objects than to objects farther away.
Centroid-based clustering- In this method clusters are
represented by central vector, which may not necessarily be a
member of the data set.
Distribution-based clustering- The clustering model most
closely related to statistics is based on distribution models
Density-based clustering- Clusters are defined as areas of
higher density than the remainder of the data set. Objects in these
sparse areas are usually considered to be the noise or border
Market Research: Market researchers use cluster analysis to
partition the general population of consumers into market
Social network analysis: In the study of social network,
clustering may be used to recognize communities within large
groups of people.
Computer Science: Clustering is useful in software evolution
as it helps to reduce legacy properties in code by reforming
functionality that has become dispersed.
Contour tracing is used to extract boundaries; boarder pixels
of boundaries are extracted. Contour tracing is one of the
many pre-processing techniques performed on digital image in
order to extract information about general shape.
Contour detection is used because contour pixels are generally
a small subset of the total number of pixels representing a
pattern. Thus, amount of computation is reduced when run
feature extracting algorithm on contour instead on whole
pattern. Also, contour shares a lot of features with the original
pattern hence, the feature extraction process become much
The Moore neighbourhood of a
pixel, P, is the set of 8 pixels
which share a vertex or edge with
Given a digital pattern; locate a
black pixel and declare it as your
"start" pixel. Locating a "start"
pixel can be done in a number of
ways; we'll start at the bottom left
corner of the grid, scan each
column of pixels from the bottom
going upwards starting from the
leftmost column and proceeding
to the right- until we encounter a
After reading books, research papers on clustering and application of
clustering, and reference material I gathered that though clustering is
widely used in many fields, including contour detection, to represent data
set into more understanding data set by removing noises and clustering
useful information, it still has many drawbacks. Like application of
effective clustering technique, selection of data set, number of clusters, and
validation of result. Especially, in marketing segmentation result validation
is neglected and when done procedure is usually ambiguous.
Clustering techniques used are very sensitive to selection of data set,
number of cluster, size of data set, etc. And different technique varies
accordingly in speed, time and size complexity, accuracy of final clusters.
Though there exist many algorithms and methods for contour extraction still
these methods lack efficiency. Also, these methods are not universal
solution; they need to be customized according to new data set. In addition,
a better clustering method, that can be use with contour detection, does not
Data Selection- Selecting the appropriate variables used in the clustering
process is one of the most fundamental steps because the inclusion of
irrelevant variables may distort and render useless an otherwise useful
Clustering algorithm selection- CA encompasses a number of different
algorithms and methods for grouping objects or subjects. The increasing
number of CA methods available, combined with their specific properties,
has led some researchers to consider the bewilderment problem of selecting
the best method in some sense. Because each technique is different and has
specific properties that lead to different segmentation solutions, it is very
important to carefully select the algorithm that will be used.
Inefficiency of contour extraction algorithm- In the original description of
the algorithm used in Moore-Neighbour tracing, the stopping criterion is
visiting the start pixel for a second time.
The basic scenario is as follows: To extract a region coordinates from a 2D
grid. The value in each cell is the intensity of the area represented by that
cell. If this value is zero then the area represented by that cell represent an
empty area. Each connected set of cells with same intensity represents a
region of that intensity. A region can have holes, this means that in an
interior of a region there can be a cells of other intensity or intensity value
zero. So, problem is extract each such region with a set of hole cycles.
Many approaches are available for the study of the data; these include
representation of data in most defined form, reduction in noise, etc. While
the various methods have been developed for the above mentioned purpose
there still exist some complications. And sometimes these methods cannot
be applied on all kind of data set; data set with varying noise, dimensions,
In contour detection, cluster analysis is used for the study and to
organize of data obtained from survey. Whereas in this case,
clustering algorithm is being embedded to all the objects of the
data set including the objects not belonging to any cluster group.
Dealing with test data set and the data set, downloaded from the
UCI repository. Satisfactory results were obtained with test data
set however coordinates from contour detection data set are
showing ambiguity. One of the advantages of proposed algorithm
is that it is effectively applicable on large data set with small
dimensions. And the validation of clusters is also done
effectively. This makes the method highly robust against possible
attacks. Attacks such as clustering high dimensional data set can
be further carried out.
Experimentation with variable data set and different algorithms will enable
a better understanding of the proposed clustering scheme.
In contour detection cluster analysis, I applied clustering algorithm to the
test data set thus forming clusters with distance as similarity measure.
Variations of this approach can be considered. For example, instead of
applying my algorithm any other existing method can be used.
The clustering technique for test data set was extended for testing
validation and stability of clusters. Various type of attacks performed can
be carried out to test the robustness of the scheme.
Also, besides proposed clustering method another method can be used to
carry out the clustering, contour extraction and validation effectively.
Difficulty in comparing quality of the clusters produced (e.g. for different initial
partitions or values of K affect outcome).
Fixed number of clusters can make it difficult to predict what K should be.
Different initial partitions can result in different final clusters. It is helpful to return
the program using the same as well as different K values, to compare the results
Euclidean distance measures can unequally weight underlying factors. If there are
two highly overlapping data then algorithm will not be able to resolve that there are
Output file generated may contain mixed coordinates of holes and pixels of