Presenting by: Pushkar Kumar
Course: BCA 3rd Year Aft.
Presenting to: Ms. Rupali Pandey
• Introduction
• categorization of major clustering methods
• partitioning methods
• Hierarchical methods
• outlier analysis
Contents
Clustering
• It is basically a type of Unsupervised learning method;
• It is a method in which we draw reference from datasets consisting of
input data without labeled responses.
• Clustering is the task of dividing the population or data point into a
number of groups such that data point in the same groups are more
similar to other data points in the same group and dissimilar to the data
points in other groups.
•There are no criteria for a good clustering, It depends on the user, what
is the criteria they may use which satisfy their need.
Introduction
Drawbacks of Traditional Clustering Algorithms
 Favor Cluster approximating spherical shapes.
 Similar Size.
 poor at handling Outliers.
Methods of using Clustering
1. Centroid by finding dmean
dmean (Ca, Cb) = || Ma - Mb ||
2. All point approach by finding dmin.
dmin (Ca, Cb) = minimum(|| pa,i –pb,j ||)
Application of cluster analysis:
• It is widely used in many applications such as image processing, data
analysis, and pattern recognition.
• It can be used in the field of biology, by deriving animal and plant
taxonomies, Identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the web.
• Clustering is used in outlier detection application such as detection of
credit card fraud.
• It also help in identification of areas of similar land use in an earth
observation database.
categorization of major clustering methods
Clustering methods can be classified into the following categories
 Partitioning Method
 Hierarchical Method
 Density-Based Method
 Grid-based Method
 Model-Based Method
 Constraints-Based Method
Partitioning Method
 These Methods partition the object into k cluster and each partition forms one
cluster.
• Each group has at least one Object, each object belonging to one group
• In this method starts with one big cluster and downward step by
step reaches the number of cluster wanted partitioning the existing
clusters.
• Then it uses the iterative relocation technique to improve the partitioning
by moving object from one group to other.
• There are many algorithms that come under partitioning methods
some the popular are: K-means, CLARANS(Clustering Large Application
based upon Randomized Search) etc.
K-Mean (A centroid based Technique)
• We are given a data set of items, with certain futures, and values
for these features (Like a vector).
• The tasks to categorize those items into groups. To achieve this,
we will use the k-Means algorithm.
• An unsupervised learning algorithm.
• The algorithms will categorize the items into k groups of
similarity.
• To calculate the similarity, we will use the Euclidean distance as
measurement.
The algorithm works as follows:
1. First we initialize k points, called means, randomly.
1. First we initialize k points, called means, randomly.
2. We categorize each item to its closest mean and we update the
mean’s coordinates, which are the averages of the items
categorized in that mean so far.
3. We repeat the process for a given number of iterations and at the
end, we have our clusters.
The “Points” mentioned above are called means, because they hold
the mean values of the items categorized in it.
Hierarchical Methods
 In this method starts with single point cluster and
upward step by step merge cluster until desired number of
cluster is reached.
• It is begins by treating every data point as a separate cluster.
• New cluster is formed using the previously formed one.
• It is divided into two category:
 Agglomerative (Bottom up approach)
 Divisive (Top down approach)
• Example: CURE (Clustering Using Representatives), BIRCH
(Balanced Iterative Reducing Clustering and using Hierarchies)
etc.
Basic Concept of CURE Algorithm
CURE(Clustering using Representatives)
 It is a hierarchical based clustering technique, that adopts a
middle ground between the centroid based and the all-point
techniques.
 It is used for identifying the spherical and non-spherical
clusters.
 Pre defined representatives points.
 Works with the outliers.
 Shrinking the cluster with the factor.
CURE Architecture
Random Sampling
 When all data set is considered as input of algorithm,
execution time could be high due to the I/O costs. So,
Random samples are considered as input of algorithm.
 Random sampling is fitted in main memory.
 Random samples are generated very fast.
 The overhead of generating random sample is very small
compared to the time for performing the clustering on the
sample.
Partitioning Sample
 Random samples are created.
 Partitioning helps to speed up the CURE algorithm.
 The steps followed are
 Partition the data point into different partitions.(n/p).
 The advantage of partitioning the input is to reduce the
execution time.
 Each n/p group of point fit in the main memory for increasing
performance of partial clustering.
Handling Outlier
 Random sampling filter out the majority of outliers.
 Outliers due to their larger distance from the points tend
to merge with other point, and grow slower.
 Number of outliers are less then clusters.
 So, first the clusters which are growing very slowly are
identified and eliminated.
 Second, at the end of growing process,, very small
cluster are eliminated.
Handling Outlier
Labeling Data on Disk
 The process of sampling the initial data set, exclude the
majority of data points. This data point must be assigned to
some cluster created in former phases.
Conclusion
 We have see that CURE can detect cluster with non-spherical
shape and wide variance in size using a set of representative
point for each cluster.
 CURE can also have a good execution time in presence of large
database using random sampling and partitioning methods.
 CURE works well when the database contains outliers. These
are detected and eliminated.

Cluster analysis

  • 1.
    Presenting by: PushkarKumar Course: BCA 3rd Year Aft. Presenting to: Ms. Rupali Pandey
  • 2.
    • Introduction • categorizationof major clustering methods • partitioning methods • Hierarchical methods • outlier analysis Contents
  • 3.
    Clustering • It isbasically a type of Unsupervised learning method; • It is a method in which we draw reference from datasets consisting of input data without labeled responses. • Clustering is the task of dividing the population or data point into a number of groups such that data point in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. •There are no criteria for a good clustering, It depends on the user, what is the criteria they may use which satisfy their need. Introduction
  • 4.
    Drawbacks of TraditionalClustering Algorithms  Favor Cluster approximating spherical shapes.  Similar Size.  poor at handling Outliers. Methods of using Clustering 1. Centroid by finding dmean dmean (Ca, Cb) = || Ma - Mb || 2. All point approach by finding dmin. dmin (Ca, Cb) = minimum(|| pa,i –pb,j ||)
  • 5.
    Application of clusteranalysis: • It is widely used in many applications such as image processing, data analysis, and pattern recognition. • It can be used in the field of biology, by deriving animal and plant taxonomies, Identifying genes with the same capabilities. • It also helps in information discovery by classifying documents on the web. • Clustering is used in outlier detection application such as detection of credit card fraud. • It also help in identification of areas of similar land use in an earth observation database.
  • 6.
    categorization of majorclustering methods Clustering methods can be classified into the following categories  Partitioning Method  Hierarchical Method  Density-Based Method  Grid-based Method  Model-Based Method  Constraints-Based Method
  • 7.
    Partitioning Method  TheseMethods partition the object into k cluster and each partition forms one cluster. • Each group has at least one Object, each object belonging to one group • In this method starts with one big cluster and downward step by step reaches the number of cluster wanted partitioning the existing clusters. • Then it uses the iterative relocation technique to improve the partitioning by moving object from one group to other. • There are many algorithms that come under partitioning methods some the popular are: K-means, CLARANS(Clustering Large Application based upon Randomized Search) etc.
  • 8.
    K-Mean (A centroidbased Technique) • We are given a data set of items, with certain futures, and values for these features (Like a vector). • The tasks to categorize those items into groups. To achieve this, we will use the k-Means algorithm. • An unsupervised learning algorithm. • The algorithms will categorize the items into k groups of similarity. • To calculate the similarity, we will use the Euclidean distance as measurement. The algorithm works as follows: 1. First we initialize k points, called means, randomly.
  • 9.
    1. First weinitialize k points, called means, randomly. 2. We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far. 3. We repeat the process for a given number of iterations and at the end, we have our clusters. The “Points” mentioned above are called means, because they hold the mean values of the items categorized in it.
  • 10.
    Hierarchical Methods  Inthis method starts with single point cluster and upward step by step merge cluster until desired number of cluster is reached. • It is begins by treating every data point as a separate cluster. • New cluster is formed using the previously formed one. • It is divided into two category:  Agglomerative (Bottom up approach)  Divisive (Top down approach) • Example: CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies) etc.
  • 11.
    Basic Concept ofCURE Algorithm CURE(Clustering using Representatives)  It is a hierarchical based clustering technique, that adopts a middle ground between the centroid based and the all-point techniques.  It is used for identifying the spherical and non-spherical clusters.  Pre defined representatives points.  Works with the outliers.  Shrinking the cluster with the factor.
  • 12.
  • 13.
    Random Sampling  Whenall data set is considered as input of algorithm, execution time could be high due to the I/O costs. So, Random samples are considered as input of algorithm.  Random sampling is fitted in main memory.  Random samples are generated very fast.  The overhead of generating random sample is very small compared to the time for performing the clustering on the sample.
  • 14.
    Partitioning Sample  Randomsamples are created.  Partitioning helps to speed up the CURE algorithm.  The steps followed are  Partition the data point into different partitions.(n/p).  The advantage of partitioning the input is to reduce the execution time.  Each n/p group of point fit in the main memory for increasing performance of partial clustering.
  • 15.
    Handling Outlier  Randomsampling filter out the majority of outliers.  Outliers due to their larger distance from the points tend to merge with other point, and grow slower.  Number of outliers are less then clusters.  So, first the clusters which are growing very slowly are identified and eliminated.  Second, at the end of growing process,, very small cluster are eliminated.
  • 16.
  • 17.
    Labeling Data onDisk  The process of sampling the initial data set, exclude the majority of data points. This data point must be assigned to some cluster created in former phases. Conclusion  We have see that CURE can detect cluster with non-spherical shape and wide variance in size using a set of representative point for each cluster.  CURE can also have a good execution time in presence of large database using random sampling and partitioning methods.  CURE works well when the database contains outliers. These are detected and eliminated.