DATA CLUSTRING
 DATA
 Data is any raw material or unorganized information.
 CLUSTER
 Cluster is group of objects that belongs to a same class.
 Cluster is a set of tables physically stored together as
one table that shares common columns.
Data Clustering
 Data clustering is technique in which the information
that is logically similar is physically stored together.
 Clustering is “the process of organizing objects into
groups whose members are similar in some way
 In clustering the objects of similar properties are
placed in one class of objects. (eg: Nic,lib)
DATA CLUSTRING
Why clustering?
A few good reasons ...
 Simplifications (eg. Lib)
 Pattern detection (eg. fb img)
 Useful in data concept construction
 Unsupervised learning process
 Procedure that identify groups in the data.
 Where we use data clustering ?
 Data Mining
 Pattern Recognition
 Speech Recognition
 Text Mining
 Web Analysis
 Marketing
 Medical Diagnostic
 Image Processing
Applications of Data Clustering
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
What Is Good Clustering ?
Good Clustering
 Data mining is the process to discover information
from large amounts of data, using pattern recognition
technologies and mathematical techniques.
 Data mining is widely used in many domains, such as
retail, finance, telecommunication and social media
Data Clustering in Data Mining
(The analysis step of the "Knowledge
Discovery in Databases" process, or KDD)
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
Major Clustering Approaches
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
Partitioning Methods
Given k, the k-means algorithm is implemented in 4 steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of
the current partition. The centroid is the center (mean
point) of the cluster.
Assign each object to the cluster with the nearest seed
point.
Go back to Step 2, stop when no more new assignment.
The K-Means Clustering Method
.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
The K-Means Clustering Method EXAMPLE
 Create a hierarchical decomposition of the set of data
(or objects) using some criterion
Hierarchical Clustering
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not require the
number of clusters k as an input, but needs a termination condition
agglomerative
(AGNES)
Bottom-up
divisive
(DIANA)
Top-down
c
d
e
a
b
ab
de
cde
abcde
Density-based: based on connectivity and density
functions
Grid-based: based on a multiple-level granularity
structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that
model to each other
Other Algorithms
 Scalability
 We need highly scalable clustering algorithms to deal with large databases.
 The ability of a system to handle a growing amount of work in a capable manner
 Ability to deal with different kind of attributes
 Algorithms should be capable to be applied on any kind of data such as interval based
(numerical) data, categorical, binary data.
 High dimensionality
 The clustering algorithm should not only be able to handle low- dimensional data but
also the high dimensional space.
 Ability to deal with noisy data
 Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to
such data and may lead to poor quality clusters.
 Interpretability
 The clustering results should be interpretable, comprehensible and usable.
Requirements of Clustering in Data
Mining
Conclusion
In this presentation, i try to give the basic concept of
clustering by first providing the definition of clustering and
then the definition of some related terms. i give some
examples to elaborate the concept. Then i give different
approaches to data clustering and also discussed some
algorithms to implement that approaches. The partitioning
method and hierarchical method of clustering were
explained. The applications of clustering are also discussed
with the examples of medical images database, data
mining using data clustering
Thank You…

Data clustring

  • 1.
  • 2.
     DATA  Datais any raw material or unorganized information.  CLUSTER  Cluster is group of objects that belongs to a same class.  Cluster is a set of tables physically stored together as one table that shares common columns. Data Clustering
  • 3.
     Data clusteringis technique in which the information that is logically similar is physically stored together.  Clustering is “the process of organizing objects into groups whose members are similar in some way  In clustering the objects of similar properties are placed in one class of objects. (eg: Nic,lib) DATA CLUSTRING
  • 5.
    Why clustering? A fewgood reasons ...  Simplifications (eg. Lib)  Pattern detection (eg. fb img)  Useful in data concept construction  Unsupervised learning process  Procedure that identify groups in the data.
  • 6.
     Where weuse data clustering ?  Data Mining  Pattern Recognition  Speech Recognition  Text Mining  Web Analysis  Marketing  Medical Diagnostic  Image Processing Applications of Data Clustering
  • 7.
     A goodclustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation.  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. What Is Good Clustering ?
  • 8.
  • 9.
     Data miningis the process to discover information from large amounts of data, using pattern recognition technologies and mathematical techniques.  Data mining is widely used in many domains, such as retail, finance, telecommunication and social media Data Clustering in Data Mining (The analysis step of the "Knowledge Discovery in Databases" process, or KDD)
  • 10.
     Partitioning Methods Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Model-Based Clustering Methods Major Clustering Approaches
  • 11.
    Partitioning method: Constructa partition of a database D of n objects into a set of k clusters Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen’67): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster Partitioning Methods
  • 12.
    Given k, thek-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment. The K-Means Clustering Method
  • 13.
    . 0 1 2 3 4 5 6 7 8 9 10 0 1 23 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 The K-Means Clustering Method EXAMPLE
  • 14.
     Create ahierarchical decomposition of the set of data (or objects) using some criterion Hierarchical Clustering
  • 15.
    Hierarchical Clustering  Usedistance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition agglomerative (AGNES) Bottom-up divisive (DIANA) Top-down c d e a b ab de cde abcde
  • 16.
    Density-based: based onconnectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Other Algorithms
  • 17.
     Scalability  Weneed highly scalable clustering algorithms to deal with large databases.  The ability of a system to handle a growing amount of work in a capable manner  Ability to deal with different kind of attributes  Algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data.  High dimensionality  The clustering algorithm should not only be able to handle low- dimensional data but also the high dimensional space.  Ability to deal with noisy data  Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.  Interpretability  The clustering results should be interpretable, comprehensible and usable. Requirements of Clustering in Data Mining
  • 18.
    Conclusion In this presentation,i try to give the basic concept of clustering by first providing the definition of clustering and then the definition of some related terms. i give some examples to elaborate the concept. Then i give different approaches to data clustering and also discussed some algorithms to implement that approaches. The partitioning method and hierarchical method of clustering were explained. The applications of clustering are also discussed with the examples of medical images database, data mining using data clustering
  • 20.