Data clustring

 DATA
 Data is any raw material or unorganized information.
 CLUSTER
 Cluster is group of objects that belongs to a same class.
 Cluster is a set of tables physically stored together as
one table that shares common columns.
Data Clustering

 Data clustering is technique in which the information
that is logically similar is physically stored together.
 Clustering is “the process of organizing objects into
groups whose members are similar in some way
 In clustering the objects of similar properties are
placed in one class of objects. (eg: Nic,lib)
DATA CLUSTRING

Why clustering?
A few good reasons ...
 Simplifications (eg. Lib)
 Pattern detection (eg. fb img)
 Useful in data concept construction
 Unsupervised learning process
 Procedure that identify groups in the data.

 Where we use data clustering ?
 Data Mining
 Pattern Recognition
 Speech Recognition
 Text Mining
 Web Analysis
 Marketing
 Medical Diagnostic
 Image Processing
Applications of Data Clustering

 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
What Is Good Clustering ?

 Data mining is the process to discover information
from large amounts of data, using pattern recognition
technologies and mathematical techniques.
 Data mining is widely used in many domains, such as
retail, finance, telecommunication and social media
Data Clustering in Data Mining
(The analysis step of the "Knowledge
Discovery in Databases" process, or KDD)

 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Clustering Methods
Major Clustering Approaches

Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the
center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
Partitioning Methods

Given k, the k-means algorithm is implemented in 4 steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of
the current partition. The centroid is the center (mean
point) of the cluster.
Assign each object to the cluster with the nearest seed
point.
Go back to Step 2, stop when no more new assignment.
The K-Means Clustering Method

.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
The K-Means Clustering Method EXAMPLE

 Create a hierarchical decomposition of the set of data
(or objects) using some criterion
Hierarchical Clustering

Hierarchical Clustering
 Use distance matrix as clustering criteria. This method does not require the
number of clusters k as an input, but needs a termination condition
agglomerative
(AGNES)
Bottom-up
divisive
(DIANA)
Top-down
c
d
e
a
b
ab
de
cde
abcde

Density-based: based on connectivity and density
functions
Grid-based: based on a multiple-level granularity
structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that
model to each other
Other Algorithms

 Scalability
 We need highly scalable clustering algorithms to deal with large databases.
 The ability of a system to handle a growing amount of work in a capable manner
 Ability to deal with different kind of attributes
 Algorithms should be capable to be applied on any kind of data such as interval based
(numerical) data, categorical, binary data.
 High dimensionality
 The clustering algorithm should not only be able to handle low- dimensional data but
also the high dimensional space.
 Ability to deal with noisy data
 Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to
such data and may lead to poor quality clusters.
 Interpretability
 The clustering results should be interpretable, comprehensible and usable.
Requirements of Clustering in Data
Mining

Conclusion
In this presentation, i try to give the basic concept of
clustering by first providing the definition of clustering and
then the definition of some related terms. i give some
examples to elaborate the concept. Then i give different
approaches to data clustering and also discussed some
algorithms to implement that approaches. The partitioning
method and hierarchical method of clustering were
explained. The applications of clustering are also discussed
with the examples of medical images database, data
mining using data clustering

Data clustring

More Related Content

What's hot

Similar to Data clustring

More from Salman Memon

Recently uploaded

Data clustring