Running Head: CLUSTERING TECHNIQUES
CLUSTERING TECHNIQUES
NAME:
INSTITUTION:
CLUSTERING TECHNIQUES
Introduction
Clustering and classification are a fundamental activity in Data Mining. Classification is
therefore used mostly as a supervised learning process while clustering is used for unsupervised
learning and clustering model is for the two (Raymond, & Jiawei 2003). The main aim of
clustering is, descriptive, for classification is predictive. The assessment is extrinsic, since, the
aim of clustering is to completely discover a new set of categories; the new categories are of
interest in there on. Extrinsic is a very crucial part of the assessment since it is a must for the
group to reflect part of a reference set of classes (Jiawei & Michelle, 2001). While on the hand
Similarity is the measure of how much three or even more items are relevant. Similarity can,
therefore, be seen as the numerical distance between multiple data objects that are typically
represented as a value between the range of 0 not similar and 1completely similar. The triangle
inequality between objects may hold, depending on the similarity metric used, but the two
properties that must be maintained for similarities is the measure of similarity which must fall in
between the range of 0 and 1 and the symmetry (Dunn, 2004). Symmetry, therefore, being the
main property that describes that for all x and y the similarity of x and y must also be the same as
the similarity of y and x (Achtert,et a. 2007).
Advantages of the Clustering Techniques
Change requests can easily consolidate according to structured data which the value domain is
completely defined. Questions, e.g., how many modification requests have been submitted to
priority or severity level, a category, and which ones are still in the Open state can be responded
with a simple IBM® Rational Team Concert™ query (Raymond, & Jiawei 2003).
There are advantages that can be obtained only by deriving intelligence from this type of data
and combining change requests based on unstructured data is not trivial. It therefore describes an
CLUSTERING TECHNIQUES
approach for investigating Rational Team Concert modified request patterns by tokenizing the
text attributes and also applying machine learning techniques, specifically clustering algorithms
that group changes requests by similarity. In using this type of analysis, software development
teams benefit in the following areas (Achtert,et a. 2007).
Quality improvement
There might be an opportunity for improving the process that relates to that area that would
reduce the number of future issues if many change requests are associated with the general theme
(Jiawei, & Michelle, 2001). Reuse; following overall framework or even applying the same
solution pattern Change requests of the same grouping might be solved by a similar approach.
Finding duplicates; before even submitting a new change request by checking at the similar
request, it is more efficient to search for duplicates collaboration patterns: it is the understanding
of the team members which contribute to solving related change requests can assist substantiate
organizational change decisions, refine career goals, and develop or improve skills (Achtert,et a.
2007).
Clustering methods
Clustering and classification are fundamental activities in Data Mining. Classification is
mostly used as a supervised learning procedure while clustering is used for unsupervised
learning and some clustering models are even for both (Raymond, & Jiawei 2003). The aim of
clustering is descriptive, and that of classification is predictive. The new groups are therefore of
interest in their assessment is intrinsic, and themselves. An important part of the assessment is
extrinsic in classification tasks (Dunn, 2004).Since the groups must reflect some of the reference
set of classes. Clustering groups data illustrates into subsets in such a manner that similar cases
CLUSTERING TECHNIQUES
grouped together, while different cases belong to different categories. The instances are thereby
re-organized into an efficient representation that characterizes the inhabitant being kept for
sample (Achtert,et a. 2007).
Therefore for the clustering structure is represented as a set of subsets C=C1….,Ck of S, such
that :S=UK-1 Ci and CiCj=; for i6=j. Consequently, any occurrence in S belongs to only one
subset. Clustering of objects is as vulnerable as the human requirement for describing the salient
characteristics of men and objects and also identifying them with a type (Raymond, & Jiawei
2003). It embraces various principals scientist from calculations and statistics to biology and
genetics; each uses different terms to describe the topologies formed with the use of this analysis.
As biological taxonomies to medical syndromes and also genetic genotypes to producing group
technology, the demerit is identical which forms categories of entities and assigning individuals
to the proper groups within it (Jiawei, & Michelle, 2001).
Since clustering is the grouping of similar objects, some measure that can determine whether two
objects are relevant or not relevant is required (Dunn, 2004). There are two types of measures
that are used to estimate this relation, similarity measures and distance measures. Several
clustering methods use distance measures to obtain the similarity or non-similarity between the
pairs of objects. It is crucial to denote the difference between two points xi and xj as: d (xi, xj).
Valid distance measure should always be symmetric and should also obtain its minimum value
which is usually zero in the case of the identical vectors (Raymond, & Jiawei 2003). The
distance measure is called a metric distance measure if it also satisfies the following properties
K-Means clustering goals to divide n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, representing a prototype of the cluster. It results in a
CLUSTERING TECHNIQUES
partitioning of the data space into Voronoi cells. K-means is the most simplest unsupervised
learning algorithms that solve the well-known clustering matter (Jiawei, & Michelle, 2001). The
process follows a very simple way to classify a given data set by a certain figure of clusters,
therefore, assume k clusters (fixed a priori). The essential idea is to define k centroids, one for
each cluster. These centroids should be placed in a scheming way because of different location
causes a different result. Therefore, the best choice is to put them as much as possible distance
from one other.
The following step is to take each position that belongs to a given set of data related with the
centroid that much near (Achtert,et a. 2007). The first step is completed, when no point is
pending, and an early groupage is over. At this position, it is required to recalculate k new
centroids to be barycenters of the clusters as a result of the previous step. After the k new
centroids, a new binding has to be done within the same set of data points and the closest new
centroid. A loop has therefore been generated. As a result of the loop, it might be noticed that the
k centroids change their location step by step until all changes are over. this means, centroids do
not move any further.
where
Is a selected distance portion between a data point and the cluster center, is an indicator of the
distance of the n data points from the cluster positions.
CLUSTERING TECHNIQUES
The algorithm is composed of the following steps:
1. Position K points into the space that is represented by the objects that are being clustered.
These points represent initial group centroids.
2. categorize each object to the group that has the closest centroid.
3. Recalculation of the position of K centroid after assigning all the objects.
4. Repeat Steps 2 and 3 until the centroids cannot move anymore. It will generates a partition
of the objects into groups from which the metric to be minimized can also be calculated.
However, it will be proved that the procedure will always eliminate, the k-means algorithm does
not necessarily find the most optimal calculation, equating to the global objective function
minimum (Raymond, & Jiawei 2003). The algorithm is also significantly sensitive to the past
randomly selected cluster centers. The k-means algorithm can be run several times to lower this
effect (Jiawei, & Michelle, 2001).
K-means is a simple algorithm that has been adapted to many problem domains. As it is going to
be observed, the best candidate for extension to work with fuzzy feature vectors.
E.g., Make the past guesses for the means m1, m2, ..., mk Suppose that we have an n sample
feature vectors x1, x2, ..., xn all from the same position, which they get into k compact clusters, k
< n. Let mi be the mean of the vectors in cluster i. If the clusters are separated, we can use a
minimum-distance classifier to separate them. That is, we can say that x is in cluster i if || x -ji ||
is the average of the entire k distances. It, therefore, suggests the following procedure for finding
the k means, until no changes is noticed in any mean (Achtert,et a. 2007)..
CLUSTERING TECHNIQUES
The k-means procedure is a simple version; it can be viewed as a greedy algorithm for dividing
the n samples into k clusters to minimize the overall of the squared differences between the
cluster centers. It has some weaknesses like as follows:
• One popular way to start is to choose k of the samples randomly, since way to initialize the
means has not been specified .
• The results produced depend on the initial values for the means, and it frequently happens
that suboptimal partitions are found. The required solution is to try some variable starting points.
• It might happen that the set of samples nearest to mi is empty hence mi cannot be updated.
This is a very embarrassing situation that must be handled in an implementation.
• The results depend on the metric used to measure || x - mi ||. A popular way is to stabilize
each difference by its quality deviation, however it is not desirable.
• The results depend on the value of k.
CLUSTERING TECHNIQUES
This last problem is particularly disappointing since we often have no way of knowing how
many clusters exist.
Unfortunately there is no overall theoretical solution to finding the optimal amount of
clusters for every given set of data . A simple approach is to do comparison the results of
multiple moves with different k classes and choose the best one according to a given criterion
however it is required to be careful since increasing k results in smaller error function
evaluations by definition, but also increasing risk of overfitting (Raymond, & Jiawei 2003).
Advantages of k-means clustering
Time Complexity
According to Shehroz Khan 2015, the solution on execution time, K-means is linear in
the number of data objects i.e. O(n), in which n refers to number of data objects. The time
complexity of most of the hierarchical cluster algorithms is a quadratic i.e. O(n2). For the same
amount of data, hierarchical clustering therefore takes quadratic amount of time (Jiawei, &
Michelle, 2001).
CLUSTERING TECHNIQUES
Shape of Clusters
K-means works well when the shape of clusters is hyper-spherical or even
circular in 2 dimensions. If the natural clusters occurring in the dataset are not spherical,
probably K-means is not best option (Dunn, 2004).
Repeatability
K-means starts with a random choice of cluster centers; it may, therefore, yield
different clustering results on several runs of the algorithm (Achtert, et a. 2007). Hence, the
results might lack consistency and also not be repeatable. It will most definitely result in the
same clustering with hierarchical clustering, results.
Cosine similarity
Cosine similarity is an evaluation of a product space that measures the cosine of the angle
between them (Dunn, 2004). The cosine of 0 degrees is 1, and it is below 1 for any other
different angle. It is not a judgment of magnitude orientation but orientation: two(2) vectors with
the same orientation have a cosine similarity of 1, two vectors at over 90° have a similarity of 0,
and two(2) vectors diametrically opposed have a similarity of -1, impartial of their magnitude.
Cosine similarity is completely used in positive space, where the outcome is neatly bounded in
(Raymond, & Jiawei 2003)].
Note that these bounds can apply for any number of calculations, and cosine variable is most
likely used in high-dimensional positive spaces (Achtert, et a. 2007). For example, text mining
CLUSTERING TECHNIQUES
and in information retrieval, each term is globally assigned a divergent dimension, and a
document is distinguished by a vector where the sum of each dimension concur to the amount of
times that term appears in the document. Cosine similarity then awards a very crucial value of
how similar two documents are likely to be regarding their subject matter.
The skill is also used to calculate the cohesion within clusters in the point of data mining. Cosine
distance is a term used for the complement in positive space. It is important to note, that it is not
an exact distance metric as it does not contain the triangle inequality property and it violates the
conjunction axiom; to repair the triangle inequity property while keeping the original ordering, it
is required to transform to angular distance (Dunn, 2004). Non-zero dimensions need to be
accounted is one of the reasons for the collaboration of cosine similarity which is very efficient
to evaluate, especially for sparse vectors(Achter t, et al. 2007).
CLUSTERING TECHNIQUES
References
Achtert, E. Bohm, C. Kriegel, H. P. Kröger, P. & Zimek, A. (2007). On Exploring Complex
Relationships of Correlation Clusters. 19th International Conference on Scientific and
Statistical Database Management
Dunn, J. (2004). "Well separated clusters and optimal fuzzy partitions.
Journal of Cybernetics.
Jiawei Han & Michelle Kamber (2001). Data Mining: Concepts & Techniques.
Morgan Kaufmann,
Raymond, T. N. & Jiawei H.(2003). Efficient and Effective Clustering Methods for Spatial Data
Mining. Santiago,Chile. Morgan Kaufmann.

Clustering techniques final

  • 1.
    Running Head: CLUSTERINGTECHNIQUES CLUSTERING TECHNIQUES NAME: INSTITUTION:
  • 2.
    CLUSTERING TECHNIQUES Introduction Clustering andclassification are a fundamental activity in Data Mining. Classification is therefore used mostly as a supervised learning process while clustering is used for unsupervised learning and clustering model is for the two (Raymond, & Jiawei 2003). The main aim of clustering is, descriptive, for classification is predictive. The assessment is extrinsic, since, the aim of clustering is to completely discover a new set of categories; the new categories are of interest in there on. Extrinsic is a very crucial part of the assessment since it is a must for the group to reflect part of a reference set of classes (Jiawei & Michelle, 2001). While on the hand Similarity is the measure of how much three or even more items are relevant. Similarity can, therefore, be seen as the numerical distance between multiple data objects that are typically represented as a value between the range of 0 not similar and 1completely similar. The triangle inequality between objects may hold, depending on the similarity metric used, but the two properties that must be maintained for similarities is the measure of similarity which must fall in between the range of 0 and 1 and the symmetry (Dunn, 2004). Symmetry, therefore, being the main property that describes that for all x and y the similarity of x and y must also be the same as the similarity of y and x (Achtert,et a. 2007). Advantages of the Clustering Techniques Change requests can easily consolidate according to structured data which the value domain is completely defined. Questions, e.g., how many modification requests have been submitted to priority or severity level, a category, and which ones are still in the Open state can be responded with a simple IBM® Rational Team Concert™ query (Raymond, & Jiawei 2003). There are advantages that can be obtained only by deriving intelligence from this type of data and combining change requests based on unstructured data is not trivial. It therefore describes an
  • 3.
    CLUSTERING TECHNIQUES approach forinvestigating Rational Team Concert modified request patterns by tokenizing the text attributes and also applying machine learning techniques, specifically clustering algorithms that group changes requests by similarity. In using this type of analysis, software development teams benefit in the following areas (Achtert,et a. 2007). Quality improvement There might be an opportunity for improving the process that relates to that area that would reduce the number of future issues if many change requests are associated with the general theme (Jiawei, & Michelle, 2001). Reuse; following overall framework or even applying the same solution pattern Change requests of the same grouping might be solved by a similar approach. Finding duplicates; before even submitting a new change request by checking at the similar request, it is more efficient to search for duplicates collaboration patterns: it is the understanding of the team members which contribute to solving related change requests can assist substantiate organizational change decisions, refine career goals, and develop or improve skills (Achtert,et a. 2007). Clustering methods Clustering and classification are fundamental activities in Data Mining. Classification is mostly used as a supervised learning procedure while clustering is used for unsupervised learning and some clustering models are even for both (Raymond, & Jiawei 2003). The aim of clustering is descriptive, and that of classification is predictive. The new groups are therefore of interest in their assessment is intrinsic, and themselves. An important part of the assessment is extrinsic in classification tasks (Dunn, 2004).Since the groups must reflect some of the reference set of classes. Clustering groups data illustrates into subsets in such a manner that similar cases
  • 4.
    CLUSTERING TECHNIQUES grouped together,while different cases belong to different categories. The instances are thereby re-organized into an efficient representation that characterizes the inhabitant being kept for sample (Achtert,et a. 2007). Therefore for the clustering structure is represented as a set of subsets C=C1….,Ck of S, such that :S=UK-1 Ci and CiCj=; for i6=j. Consequently, any occurrence in S belongs to only one subset. Clustering of objects is as vulnerable as the human requirement for describing the salient characteristics of men and objects and also identifying them with a type (Raymond, & Jiawei 2003). It embraces various principals scientist from calculations and statistics to biology and genetics; each uses different terms to describe the topologies formed with the use of this analysis. As biological taxonomies to medical syndromes and also genetic genotypes to producing group technology, the demerit is identical which forms categories of entities and assigning individuals to the proper groups within it (Jiawei, & Michelle, 2001). Since clustering is the grouping of similar objects, some measure that can determine whether two objects are relevant or not relevant is required (Dunn, 2004). There are two types of measures that are used to estimate this relation, similarity measures and distance measures. Several clustering methods use distance measures to obtain the similarity or non-similarity between the pairs of objects. It is crucial to denote the difference between two points xi and xj as: d (xi, xj). Valid distance measure should always be symmetric and should also obtain its minimum value which is usually zero in the case of the identical vectors (Raymond, & Jiawei 2003). The distance measure is called a metric distance measure if it also satisfies the following properties K-Means clustering goals to divide n observations into k clusters in which each observation belongs to the cluster with the nearest mean, representing a prototype of the cluster. It results in a
  • 5.
    CLUSTERING TECHNIQUES partitioning ofthe data space into Voronoi cells. K-means is the most simplest unsupervised learning algorithms that solve the well-known clustering matter (Jiawei, & Michelle, 2001). The process follows a very simple way to classify a given data set by a certain figure of clusters, therefore, assume k clusters (fixed a priori). The essential idea is to define k centroids, one for each cluster. These centroids should be placed in a scheming way because of different location causes a different result. Therefore, the best choice is to put them as much as possible distance from one other. The following step is to take each position that belongs to a given set of data related with the centroid that much near (Achtert,et a. 2007). The first step is completed, when no point is pending, and an early groupage is over. At this position, it is required to recalculate k new centroids to be barycenters of the clusters as a result of the previous step. After the k new centroids, a new binding has to be done within the same set of data points and the closest new centroid. A loop has therefore been generated. As a result of the loop, it might be noticed that the k centroids change their location step by step until all changes are over. this means, centroids do not move any further. where Is a selected distance portion between a data point and the cluster center, is an indicator of the distance of the n data points from the cluster positions.
  • 6.
    CLUSTERING TECHNIQUES The algorithmis composed of the following steps: 1. Position K points into the space that is represented by the objects that are being clustered. These points represent initial group centroids. 2. categorize each object to the group that has the closest centroid. 3. Recalculation of the position of K centroid after assigning all the objects. 4. Repeat Steps 2 and 3 until the centroids cannot move anymore. It will generates a partition of the objects into groups from which the metric to be minimized can also be calculated. However, it will be proved that the procedure will always eliminate, the k-means algorithm does not necessarily find the most optimal calculation, equating to the global objective function minimum (Raymond, & Jiawei 2003). The algorithm is also significantly sensitive to the past randomly selected cluster centers. The k-means algorithm can be run several times to lower this effect (Jiawei, & Michelle, 2001). K-means is a simple algorithm that has been adapted to many problem domains. As it is going to be observed, the best candidate for extension to work with fuzzy feature vectors. E.g., Make the past guesses for the means m1, m2, ..., mk Suppose that we have an n sample feature vectors x1, x2, ..., xn all from the same position, which they get into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are separated, we can use a minimum-distance classifier to separate them. That is, we can say that x is in cluster i if || x -ji || is the average of the entire k distances. It, therefore, suggests the following procedure for finding the k means, until no changes is noticed in any mean (Achtert,et a. 2007)..
  • 7.
    CLUSTERING TECHNIQUES The k-meansprocedure is a simple version; it can be viewed as a greedy algorithm for dividing the n samples into k clusters to minimize the overall of the squared differences between the cluster centers. It has some weaknesses like as follows: • One popular way to start is to choose k of the samples randomly, since way to initialize the means has not been specified . • The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The required solution is to try some variable starting points. • It might happen that the set of samples nearest to mi is empty hence mi cannot be updated. This is a very embarrassing situation that must be handled in an implementation. • The results depend on the metric used to measure || x - mi ||. A popular way is to stabilize each difference by its quality deviation, however it is not desirable. • The results depend on the value of k.
  • 8.
    CLUSTERING TECHNIQUES This lastproblem is particularly disappointing since we often have no way of knowing how many clusters exist. Unfortunately there is no overall theoretical solution to finding the optimal amount of clusters for every given set of data . A simple approach is to do comparison the results of multiple moves with different k classes and choose the best one according to a given criterion however it is required to be careful since increasing k results in smaller error function evaluations by definition, but also increasing risk of overfitting (Raymond, & Jiawei 2003). Advantages of k-means clustering Time Complexity According to Shehroz Khan 2015, the solution on execution time, K-means is linear in the number of data objects i.e. O(n), in which n refers to number of data objects. The time complexity of most of the hierarchical cluster algorithms is a quadratic i.e. O(n2). For the same amount of data, hierarchical clustering therefore takes quadratic amount of time (Jiawei, & Michelle, 2001).
  • 9.
    CLUSTERING TECHNIQUES Shape ofClusters K-means works well when the shape of clusters is hyper-spherical or even circular in 2 dimensions. If the natural clusters occurring in the dataset are not spherical, probably K-means is not best option (Dunn, 2004). Repeatability K-means starts with a random choice of cluster centers; it may, therefore, yield different clustering results on several runs of the algorithm (Achtert, et a. 2007). Hence, the results might lack consistency and also not be repeatable. It will most definitely result in the same clustering with hierarchical clustering, results. Cosine similarity Cosine similarity is an evaluation of a product space that measures the cosine of the angle between them (Dunn, 2004). The cosine of 0 degrees is 1, and it is below 1 for any other different angle. It is not a judgment of magnitude orientation but orientation: two(2) vectors with the same orientation have a cosine similarity of 1, two vectors at over 90° have a similarity of 0, and two(2) vectors diametrically opposed have a similarity of -1, impartial of their magnitude. Cosine similarity is completely used in positive space, where the outcome is neatly bounded in (Raymond, & Jiawei 2003)]. Note that these bounds can apply for any number of calculations, and cosine variable is most likely used in high-dimensional positive spaces (Achtert, et a. 2007). For example, text mining
  • 10.
    CLUSTERING TECHNIQUES and ininformation retrieval, each term is globally assigned a divergent dimension, and a document is distinguished by a vector where the sum of each dimension concur to the amount of times that term appears in the document. Cosine similarity then awards a very crucial value of how similar two documents are likely to be regarding their subject matter. The skill is also used to calculate the cohesion within clusters in the point of data mining. Cosine distance is a term used for the complement in positive space. It is important to note, that it is not an exact distance metric as it does not contain the triangle inequality property and it violates the conjunction axiom; to repair the triangle inequity property while keeping the original ordering, it is required to transform to angular distance (Dunn, 2004). Non-zero dimensions need to be accounted is one of the reasons for the collaboration of cosine similarity which is very efficient to evaluate, especially for sparse vectors(Achter t, et al. 2007).
  • 11.
    CLUSTERING TECHNIQUES References Achtert, E.Bohm, C. Kriegel, H. P. Kröger, P. & Zimek, A. (2007). On Exploring Complex Relationships of Correlation Clusters. 19th International Conference on Scientific and Statistical Database Management Dunn, J. (2004). "Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics. Jiawei Han & Michelle Kamber (2001). Data Mining: Concepts & Techniques. Morgan Kaufmann, Raymond, T. N. & Jiawei H.(2003). Efficient and Effective Clustering Methods for Spatial Data Mining. Santiago,Chile. Morgan Kaufmann.