Clustering in Data Mining: K-Means Algorithm Explained

CLUSTERING IN DATA MINING
NAME -SHAIKH MUSKAN A.
SEAT NO-740
GUIDE NAME –MR.VIJESH
SHUKLA

Clustering In Data Mining
Overview
➢ Introduction
➢ What is Clustering?
➢ Requirements Of Clustering
➢ Application Of Clustering
➢ Clustering Types
➢ Clustering Methods
➢ K-means Algorithm
➢ Summery
➢ References

Introduction
 Clustering is the process of organising data into meaningful
groups, and these groups are called clusters.
 Clustering can be seen as a generalisation of classification. In
classification we have the knowledge about both the object,
the characteristics. So classification is more similar to just
finding “Where to put the new object in”.
 Clustering on the other hand analyses the data and finds out
the characteristics in it, either based on responses (supervised)
or more generally without any responses (unsupervised).

What Is Clustering?
➢ A Cluster is a collection of data objects which are
▪ Similar(or related) to one another within the same group(i.e Cluster)
▪ Dissimilar (or unrelated) to the objects in other groups(i.e Clusters)
➢ Clustering:
Clustering is a process of partitioning a set of data(or objects) into a set of
meaningful sub-classes, called clusters.
➢ Unsupervised learning: no predefined classes
➢ While doing cluster analysis, we first partition the set of data into groups.
That based on data similarity and then assign the labels to the groups.

Requirements Of Clustering
 Scalability
 Ability to deal with any kinds of attributes
 Discovery of clusters with attribute shape
 High Dimensionally
 Ability to deal with noisy data
 Interpretability

Application Of Clustering
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs.
 Land use: Identification of areas of similar land use in an earth
observation database.
 Classify-document: Helps in classifying documents on the web for
information discovery.
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location.
 Also we use data clustering in outlier detection application such as
detection of credit card fraud.

Clustering Types
 Portioning Clustering
✓ A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset.
 Hierarchical Clusterin
✓ A set of nested clusters organized as a hierarchical tree.

Clustering Types
Portioning Clustering Hierarchical Clustering

Clustering Methods
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method

Clustering Methods :Partitioning Method
 Partitioning Method
Partitioning method that subdivide the data objects into a set of k clusters .
where k is the number of groups pre-specified.
 the following requirements:
➢ Each group contain at least one object.
➢ Each object must belong to exactly one group.
➢ Example:
k-means algorithm

 Hierarchical Methods
▪ This method create the hierarchical decomposition of the given set of data
objects. We can classify Hierarchical method on basis of how the hierarchical
decomposition is formed as follows:
 Agglomerative Approach
• bottom-up approach.
• each object forming a separate group.
❑ Divisive Approach
• top-down approach.
• objects in the same cluster.

 Disadvantage
▪ This method is rigid i.e. once merge or split is done, It can never be undone.

Density-based Method
 Density-based Method
• This method is based on the notion of density. The basic idea is
to continue growing the given cluster as long as the density in
the neighbourhood exceeds some threshold
❑ Major Features:
• Discover Clustered of Arbitary Shape.
 Example:
• DBSCAN Algorithm

Grid-based Method
 Grid-based Method
• In this the objects together form a grid. The object space is
quantized into finite number of cells that form a grid
structure.
 Advantage
• The major advantage of this method is fast processing time.

K-means clustering algorithm
 k-means is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem.
 The procedure follows a simple and easy way to classify a given data
set through a certain number of cluster (assume k clusters) .

 How its work.

 Algorithmic steps for k-means clustering
 Given k, the k-means algorithm is implemented in four
steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the clusters of the current
partition (the centroid is the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest seed point
 Go back to Step 2, stop when no more new assignment

 Example:

 Example

K-means clustering
Advantages Disadvantages
• Simple, understandable
• items automatically
assigned to clusters
• Must pick number of
clusters before hand
• Often terminates at a
local optimum.
• All items forced into a
cluster
• Too sensitive to outliers

What Is the Problem of the K-Means ?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially distort the
distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object in
a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

Summery
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 Outlier detection and analysis are very useful for fraud detection,
etc. and can be performed by statistical, distance-based or
deviation-based approaches
 There are still lots of research issues on cluster analysis, such as
constraint-based clustering

References
 Data Mining Next Generation Challenges & Future
Directions HillolKargupta, AnupamJoshi, Yelena Yesha,
Krishnamoorthy Sivakumar PHI 9.
 Data Mining Concepts & Techniques Jiawei Han, Mining
Techniques and Trends N.P Gopalan, B. Sivasalvan PHI
 https://sites.google.com/
 https://towardsdatascience.com/
 https://www.tutorialspoint.com

Clustering in Data Mining: K-Means Algorithm Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clustering in Data Mining: K-Means Algorithm Explained

Similar to Clustering in Data Mining: K-Means Algorithm Explained (20)

Recently uploaded

Recently uploaded (20)

Clustering in Data Mining: K-Means Algorithm Explained