CLUSTERING IN DATA MINING
NAME -SHAIKH MUSKAN A.
SEAT NO-740
GUIDE NAME –MR.VIJESH
SHUKLA
Clustering In Data Mining
Overview
➢ Introduction
➢ What is Clustering?
➢ Requirements Of Clustering
➢ Application Of Clustering
➢ Clustering Types
➢ Clustering Methods
➢ K-means Algorithm
➢ Summery
➢ References
Introduction
 Clustering is the process of organising data into meaningful
groups, and these groups are called clusters.
 Clustering can be seen as a generalisation of classification. In
classification we have the knowledge about both the object,
the characteristics. So classification is more similar to just
finding “Where to put the new object in”.
 Clustering on the other hand analyses the data and finds out
the characteristics in it, either based on responses (supervised)
or more generally without any responses (unsupervised).
What Is Clustering?
➢ A Cluster is a collection of data objects which are
▪ Similar(or related) to one another within the same group(i.e Cluster)
▪ Dissimilar (or unrelated) to the objects in other groups(i.e Clusters)
➢ Clustering:
Clustering is a process of partitioning a set of data(or objects) into a set of
meaningful sub-classes, called clusters.
➢ Unsupervised learning: no predefined classes
➢ While doing cluster analysis, we first partition the set of data into groups.
That based on data similarity and then assign the labels to the groups.
Example Of Clustering
Requirements Of Clustering
 Scalability
 Ability to deal with any kinds of attributes
 Discovery of clusters with attribute shape
 High Dimensionally
 Ability to deal with noisy data
 Interpretability
Application Of Clustering
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs.
 Land use: Identification of areas of similar land use in an earth
observation database.
 Classify-document: Helps in classifying documents on the web for
information discovery.
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location.
 Also we use data clustering in outlier detection application such as
detection of credit card fraud.
Clustering Types
 Portioning Clustering
✓ A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset.
 Hierarchical Clusterin
✓ A set of nested clusters organized as a hierarchical tree.
Clustering Types
Portioning Clustering Hierarchical Clustering
Clustering Methods
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
Clustering Methods :Partitioning Method
 Partitioning Method
Partitioning method that subdivide the data objects into a set of k clusters .
where k is the number of groups pre-specified.
 the following requirements:
➢ Each group contain at least one object.
➢ Each object must belong to exactly one group.
➢ Example:
k-means algorithm
Clustering Methods :Partitioning Method
 Hierarchical Methods
▪ This method create the hierarchical decomposition of the given set of data
objects. We can classify Hierarchical method on basis of how the hierarchical
decomposition is formed as follows:
 Agglomerative Approach
• bottom-up approach.
• each object forming a separate group.
❑ Divisive Approach
• top-down approach.
• objects in the same cluster.
Clustering Methods :Partitioning Method
 Disadvantage
▪ This method is rigid i.e. once merge or split is done, It can never be undone.
Density-based Method
 Density-based Method
• This method is based on the notion of density. The basic idea is
to continue growing the given cluster as long as the density in
the neighbourhood exceeds some threshold
❑ Major Features:
• Discover Clustered of Arbitary Shape.
 Example:
• DBSCAN Algorithm
Grid-based Method
 Grid-based Method
• In this the objects together form a grid. The object space is
quantized into finite number of cells that form a grid
structure.
 Advantage
• The major advantage of this method is fast processing time.
K-means clustering algorithm
 k-means is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem.
 The procedure follows a simple and easy way to classify a given data
set through a certain number of cluster (assume k clusters) .
K-means clustering algorithm
 How its work.
K-means clustering algorithm
 Algorithmic steps for k-means clustering
 Given k, the k-means algorithm is implemented in four
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the current
partition (the centroid is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest seed point
 Go back to Step 2, stop when no more new assignment
K-means clustering algorithm
 Example:
K-means clustering algorithm
K-means clustering algorithm
 Example
K-means clustering
Advantages Disadvantages
• Simple, understandable
• items automatically
assigned to clusters
• Must pick number of
clusters before hand
• Often terminates at a
local optimum.
• All items forced into a
cluster
• Too sensitive to outliers
What Is the Problem of the K-Means ?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially distort the
distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object in
a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Summery
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 Outlier detection and analysis are very useful for fraud detection,
etc. and can be performed by statistical, distance-based or
deviation-based approaches
 There are still lots of research issues on cluster analysis, such as
constraint-based clustering
References
 Data Mining Next Generation Challenges & Future
Directions HillolKargupta, AnupamJoshi, Yelena Yesha,
Krishnamoorthy Sivakumar PHI 9.
 Data Mining Concepts & Techniques Jiawei Han, Mining
Techniques and Trends N.P Gopalan, B. Sivasalvan PHI
 https://sites.google.com/
 https://towardsdatascience.com/
 https://www.tutorialspoint.com

Clustering[306] [Read-Only].pdf

  • 1.
    CLUSTERING IN DATAMINING NAME -SHAIKH MUSKAN A. SEAT NO-740 GUIDE NAME –MR.VIJESH SHUKLA
  • 2.
    Clustering In DataMining Overview ➢ Introduction ➢ What is Clustering? ➢ Requirements Of Clustering ➢ Application Of Clustering ➢ Clustering Types ➢ Clustering Methods ➢ K-means Algorithm ➢ Summery ➢ References
  • 3.
    Introduction  Clustering isthe process of organising data into meaningful groups, and these groups are called clusters.  Clustering can be seen as a generalisation of classification. In classification we have the knowledge about both the object, the characteristics. So classification is more similar to just finding “Where to put the new object in”.  Clustering on the other hand analyses the data and finds out the characteristics in it, either based on responses (supervised) or more generally without any responses (unsupervised).
  • 4.
    What Is Clustering? ➢A Cluster is a collection of data objects which are ▪ Similar(or related) to one another within the same group(i.e Cluster) ▪ Dissimilar (or unrelated) to the objects in other groups(i.e Clusters) ➢ Clustering: Clustering is a process of partitioning a set of data(or objects) into a set of meaningful sub-classes, called clusters. ➢ Unsupervised learning: no predefined classes ➢ While doing cluster analysis, we first partition the set of data into groups. That based on data similarity and then assign the labels to the groups.
  • 5.
  • 6.
    Requirements Of Clustering Scalability  Ability to deal with any kinds of attributes  Discovery of clusters with attribute shape  High Dimensionally  Ability to deal with noisy data  Interpretability
  • 7.
    Application Of Clustering Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.  Land use: Identification of areas of similar land use in an earth observation database.  Classify-document: Helps in classifying documents on the web for information discovery.  City-planning: Identifying groups of houses according to their house type, value, and geographical location.  Also we use data clustering in outlier detection application such as detection of credit card fraud.
  • 8.
    Clustering Types  PortioningClustering ✓ A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.  Hierarchical Clusterin ✓ A set of nested clusters organized as a hierarchical tree.
  • 9.
  • 10.
    Clustering Methods  PartitioningMethod  Hierarchical Method  Density-based Method  Grid-Based Method
  • 11.
    Clustering Methods :PartitioningMethod  Partitioning Method Partitioning method that subdivide the data objects into a set of k clusters . where k is the number of groups pre-specified.  the following requirements: ➢ Each group contain at least one object. ➢ Each object must belong to exactly one group. ➢ Example: k-means algorithm
  • 12.
    Clustering Methods :PartitioningMethod  Hierarchical Methods ▪ This method create the hierarchical decomposition of the given set of data objects. We can classify Hierarchical method on basis of how the hierarchical decomposition is formed as follows:  Agglomerative Approach • bottom-up approach. • each object forming a separate group. ❑ Divisive Approach • top-down approach. • objects in the same cluster.
  • 13.
    Clustering Methods :PartitioningMethod  Disadvantage ▪ This method is rigid i.e. once merge or split is done, It can never be undone.
  • 14.
    Density-based Method  Density-basedMethod • This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighbourhood exceeds some threshold ❑ Major Features: • Discover Clustered of Arbitary Shape.  Example: • DBSCAN Algorithm
  • 15.
    Grid-based Method  Grid-basedMethod • In this the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure.  Advantage • The major advantage of this method is fast processing time.
  • 16.
    K-means clustering algorithm k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.  The procedure follows a simple and easy way to classify a given data set through a certain number of cluster (assume k clusters) .
  • 17.
  • 18.
    K-means clustering algorithm Algorithmic steps for k-means clustering  Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment
  • 19.
  • 20.
  • 21.
  • 22.
    K-means clustering Advantages Disadvantages •Simple, understandable • items automatically assigned to clusters • Must pick number of clusters before hand • Often terminates at a local optimum. • All items forced into a cluster • Too sensitive to outliers
  • 23.
    What Is theProblem of the K-Means ?  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the distribution of the data  K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 24.
    Summery  Cluster analysisgroups objects based on their similarity and has wide applications  Measure of similarity can be computed for various types of data  Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods  Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches  There are still lots of research issues on cluster analysis, such as constraint-based clustering
  • 25.
    References  Data MiningNext Generation Challenges & Future Directions HillolKargupta, AnupamJoshi, Yelena Yesha, Krishnamoorthy Sivakumar PHI 9.  Data Mining Concepts & Techniques Jiawei Han, Mining Techniques and Trends N.P Gopalan, B. Sivasalvan PHI  https://sites.google.com/  https://towardsdatascience.com/  https://www.tutorialspoint.com