The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
2. OUTLINE
• What is clustering ?
• Type Of clustering
• Type of clustering Algorithm
• K Mean clustering
• CLARA
• The K-MEDOID Clustering
• BRICH
• Density Based Clustering
• DBSCAN
• Grid-Based Clustering
• Hierarchical clustering
• Dendrogram
• Conclusion
• References
3. What is Clustering
• Clustering is the task of dividing the
population or data points into a number of
groups such that data points in the same
groups are more similar to other data points
in the same group than those in other
groups. In simple words, the aim is to
segregate groups with similar traits and
assign them into clusters.
• The method of identifying similar groups of
data in a data set is called clustering.
Entities in each group are comparatively
more similar to entities of that group than
those of the other groups.
4. Type of Clustering
• Hard Clustering: In hard clustering, each data point either belongs to
a cluster completely or not.
• Soft Clustering: In soft clustering, instead of putting each data point
into a separate cluster, each data point can belong to more than one
cluster.
Each observation belongs to exactly one cluster An observation can belong to more than one cluster to
a certain degree (e.g. likelihood of belonging to the cluster)
5. Type of clustering Algorithms
• Connectivity models: these models are based on the notion that the
data points closer in data space exhibit more similarity to each other
than the data points lying farther away.
Divisive approach /Top down approach
6. Type of clustering Algorithms (Continue…)
• Distribution models: These clustering models are based on the
notion of how probable is it that all data points in the cluster belong
to the same distribution (For example: Normal, Gaussian).
• Distance based/Centroid models: These are iterative clustering
algorithms in which the notion of similarity is derived by the
closeness of a data point to the centroid of the clusters. K-Means
clustering algorithm ,CLARA are popular algorithms that falls into this
category.
7. • Density Models:
• These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and
assign the data points within these regions in the same cluster.
• This method is based on the notion of density. The basic idea is to continue
growing the given cluster as long as the density in the neighborhood
exceeds some threshold i.e. for each data point within a given cluster, the
radius of a given cluster has to contain at least a minimum number of points.
8. K Mean Clustering
• K means is an iterative clustering algorithm that aims to find local
maxima in each iteration. This algorithm works in these 5 steps:
1. Specify the desired number of clusters K:
Let us choose k=2 for these 5 data points in 2-
D space.
2. Randomly assign each data point to a
cluster
K=2 K=2
9. 3. Compute cluster centroids: The
centroid of data points in the red cluster
is shown using red cross and those in grey
cluster using are cross
4. Re-assign each point to the closest
cluster centroid:
K=2 K=2
10. 5. re-compute cluster centroids:
Repeat steps 4 & 5 until no improvement are possible
K=2
11. Step:1 Randomly Take Two Centroid As K-2, So
Here (185.72) and (170,56).
K1 = (185,72) && K2 = (170,56)
Step2:- Calculate Distance of all points from K
Centroid. Here I Use Euclidian Distance= sqrt( (x-
x)² + (Yo-Ycl²)So Now Check Point (168,60) Goes to
which cluster so 1st fine the distance
(168,60)=> K1=Sqrt((168-185)2+(60-72)²) = 20.80
K2=Sqrt((168-170)2+(60-56)²) = 4.48
Step3:- As 4.48 small so (168,60) goes to K2
cluster. As coster update Now we recomputed k2.
k2= (170+168/2,56+60/2)=(169,58) This is now
the updated K2 value.
Step4:- Again Repeat step2 to step 3 for remaining
data points
EXAMPLE
Height(x0) Weight(y0)
185 72
170 56
168 60
179 68
Let k=2 for this problems
12. PROBLEM OF K-MEANS METHOD
The k-means algorithm is sensitive to outliers!
Since an object with an extremely large value may substantially
distort the distribution of the data
K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster
13. Weaknesses :-
Applicable only when mean is defined, then what about categorical data?
Need to specify k, the number of clusters, in advance.
Unable to handle noisy data
14. CLARA (Clustering Large Applications)
• It draws multiple samples of the data set, and gives the best
clustering as the output
Strength:
• deals with larger data sets
Weakness:
• Efficiency depends on the sample size
• A good clustering based on samples will not necessarily represent a
good clustering of the whole data set if the sample is biased
15. THE K-MEDOID CLUSTERING METHOD
K-Medoids Clustering: Find representative objects (medoids) in clusters
PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987).
Starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-
medoids if it improves the total distance of the resulting clustering.
PAM works effectively for small data sets, but does not scale well for large data sets (due to the
computational complexity).
Efficiency improvement on PAM
CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
CLARANS (Ng & Han, 1994): Randomized re-sampling
16. EXAMPLE
Step1: Select two Random Medoids. Step2:- Calculate
distance of all point into each medoid
Step3 - Compare Cost of cl and c2 for every i and select
the minimum value.So in above table in C1=(x2.x1.x3)
and C2=(x5,x4)
Step4- Now calculate the cost = (3+4+2) = 9Let K-2 For
This Problem
Step5- Now select one of non medoid and repeat the
2,3 and 4
Step6- After recalculate the medoids the again calculate
the cost We terminate when old cost value is smaller
than the new cost otherwise we repeat all step until old
cost smaller than new cost
I X Y
X1 2 6
X2 3 4
X3 3 8
X4 8 5
X5 7 4
I X(a) Y(b) C1(3,4)
|a-3|+|b-4|
C2(7,4)
X1 2 6 3 7
X3 3 8 4 8
X4 8 5 4 2
Let k=2 for this problem
17. BRICH(BALANCED ITERATIVE REDUCING AND
CLUSTERING USING HIERARCHIES)
It is a clustering algorithm that can cluster large datasets by first generating a small and
compact summary of the large dataset that retains as much information as possible.
This smaller summary is then clustered instead of clustering the larger dataset.
Phase 1: scan Database to build an initial in-memory CF tree (a multi-level compression of the data that
tries to preserve the inherent clustering structure of the data).
Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
BIRCH has one major drawback - it can only process metric attributes(integers or reals).
A metric attribute any attribute whose values can be represented in Euclidean space
i.e., no categorical attributes should be present.
Two important terms: Clustering Feature (CF) and CF - Tree
18. DENSITY BASED CLUSTERING METHODS
These methods consider the clusters as the dense region having some similarity and
different from the lower dense region of the space. These methods have good accuracy
and ability to merge two clusters.
Major features:
Discover clusters of arbitrary shape
Handle noise
Several interesting studies:
Density-based spatial clustering of
applications with noise(DBSCAN)
19. DBSCAN
Some definitions first:
Epsilon: This is also called eps. This is the
distancefill which we look for the neighbouring
points.
Min_points: The minimum number of points
specified by the user.
Core Points: If the number of points inside the eps
radius of a point is greater than or equal to the
min_points then it's called a core point.
Border Points: If the number of points inside the
eps radius of a point is less than the min_points
and it lies within the eps radius region of a core
point, it's called a border point.
Noise: A point which is neither a core nor a border
point is a noise point.
20. ALGORITHM STEPS FOR DBSCAN
1. First assign min points and eps value. The Pick a random point that has not
yet been assigned to a cluster or designated to a cluster or designated as
an outlier.
2. Determine if point contains greater than or equal points than the min pts
then this point becomes to the core point else label the point as Outlier.
3. Once a Core Point has been found add all directly reachable to its cluster.
Then de neighbor jumps to each reachable point and add them to the cluster.
If an outlier mas been added label it is boarder point.
4. Repeat the steps above until all points are classified into different clusters
or noises
21. GRID-BASED CLUSTERING METHOD
STING (Statistical Information Grid approach):
The area is divided into rectangular cells at
different levels of resolution and these form a tree
like structure.
Each cell at a high level is contains a number of
smaller cells in the next lower level.
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries.
Parameters of higher level cells can be easily
calculated from parameters of lower level cell.
count, mean, standard deviation, min, max
type of distribution-normal, uniform, etc.
In this method the data space is formulated into a finite number of cells that form a grid-
like structure. All the clustering operation done on these grids are fast and independent
of the number of data objects
22. Hierarchical clustering
Hierarchical clustering, as the name suggests is an algorithm that builds
hierarchy of clusters. This algorithm starts with all the data points
assigned to a cluster of their own. Then two nearest clusters are merged
into the same cluster. In the end, this algorithm terminates when there is
only a single cluster left.
Hierarchical Method
Divisive Approach
Agglomerative Approach
(Bottom-up) (Top-down)
23. Agglomerative Approach :-
• Bottom-up approach
• We start with each object forming a separate group. It keeps on
merging the objects or groups that are close to one another. It keep
on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach:-
• Top-down approach
• We start with all of the objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller clusters. It is down until
each object in one cluster or the termination condition holds.
Disadvantage :-This method is rigid i.e. once merge or split is done, It
can never be undone.
Hierarchical clustering (Continue…)
25. CONCLUSION
Cluster analysis groups objects based on their similarity and has wide
applications. Measure of similarity can be computed for various types of data.
Clustering algorithms can be categorized into partitioning methods, hierarchical
methods, density-based methods, grid-based methods.
K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms,
Birch and Chameleon are interesting hierarchical clustering algorithms.
DBSCAN. OPTICS, and DENCLU are interesting density-based algorithms.
STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm.
Quality of clustering results can be evaluated in various ways such as
Determining the number of clusters in a data set, Measuring clustering quality.
26. REFERENCES
1. www.google.com
2. www.youtube.com
3. R.. Agrawal. J. Gehrke. D. Gunopulos. and P Raghuvau Automatic subspace
clustering of high dimensional data for data mining applications
4. M. R. Anderberg Cluster Analysis for Applications Academic Press
5. M Ankerst. M. Breunir H-P. Kriezel, and J. Sander Optics: Ordering points to
identify the chustering structure
6. Beil F. Ester M. Xu X "Frequent Term-Based Text Chustering“
7. M. M. Breunie H.-P Kriegel, R. Ng. J. Sander LOF Identifying Density-Based
Local Outliers.