Clustering is an unsupervised machine learning technique that groups unlabeled data points into clusters based on similarities. It partitions data into meaningful subgroups without predefined labels. Common clustering algorithms include k-means, hierarchical, density-based, and grid-based methods. K-means clustering aims to partition data into k clusters where each data point belongs to the cluster with the nearest mean. It is sensitive to outliers but simple and fast.
Clustering in Data Mining: K-Means Algorithm Explained
1. CLUSTERING IN DATA MINING
NAME -SHAIKH MUSKAN A.
SEAT NO-740
GUIDE NAME –MR.VIJESH
SHUKLA
2. Clustering In Data Mining
Overview
➢ Introduction
➢ What is Clustering?
➢ Requirements Of Clustering
➢ Application Of Clustering
➢ Clustering Types
➢ Clustering Methods
➢ K-means Algorithm
➢ Summery
➢ References
3. Introduction
Clustering is the process of organising data into meaningful
groups, and these groups are called clusters.
Clustering can be seen as a generalisation of classification. In
classification we have the knowledge about both the object,
the characteristics. So classification is more similar to just
finding “Where to put the new object in”.
Clustering on the other hand analyses the data and finds out
the characteristics in it, either based on responses (supervised)
or more generally without any responses (unsupervised).
4. What Is Clustering?
➢ A Cluster is a collection of data objects which are
▪ Similar(or related) to one another within the same group(i.e Cluster)
▪ Dissimilar (or unrelated) to the objects in other groups(i.e Clusters)
➢ Clustering:
Clustering is a process of partitioning a set of data(or objects) into a set of
meaningful sub-classes, called clusters.
➢ Unsupervised learning: no predefined classes
➢ While doing cluster analysis, we first partition the set of data into groups.
That based on data similarity and then assign the labels to the groups.
6. Requirements Of Clustering
Scalability
Ability to deal with any kinds of attributes
Discovery of clusters with attribute shape
High Dimensionally
Ability to deal with noisy data
Interpretability
7. Application Of Clustering
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs.
Land use: Identification of areas of similar land use in an earth
observation database.
Classify-document: Helps in classifying documents on the web for
information discovery.
City-planning: Identifying groups of houses according to their house
type, value, and geographical location.
Also we use data clustering in outlier detection application such as
detection of credit card fraud.
8. Clustering Types
Portioning Clustering
✓ A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset.
Hierarchical Clusterin
✓ A set of nested clusters organized as a hierarchical tree.
11. Clustering Methods :Partitioning Method
Partitioning Method
Partitioning method that subdivide the data objects into a set of k clusters .
where k is the number of groups pre-specified.
the following requirements:
➢ Each group contain at least one object.
➢ Each object must belong to exactly one group.
➢ Example:
k-means algorithm
12. Clustering Methods :Partitioning Method
Hierarchical Methods
▪ This method create the hierarchical decomposition of the given set of data
objects. We can classify Hierarchical method on basis of how the hierarchical
decomposition is formed as follows:
Agglomerative Approach
• bottom-up approach.
• each object forming a separate group.
❑ Divisive Approach
• top-down approach.
• objects in the same cluster.
13. Clustering Methods :Partitioning Method
Disadvantage
▪ This method is rigid i.e. once merge or split is done, It can never be undone.
14. Density-based Method
Density-based Method
• This method is based on the notion of density. The basic idea is
to continue growing the given cluster as long as the density in
the neighbourhood exceeds some threshold
❑ Major Features:
• Discover Clustered of Arbitary Shape.
Example:
• DBSCAN Algorithm
15. Grid-based Method
Grid-based Method
• In this the objects together form a grid. The object space is
quantized into finite number of cells that form a grid
structure.
Advantage
• The major advantage of this method is fast processing time.
16. K-means clustering algorithm
k-means is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem.
The procedure follows a simple and easy way to classify a given data
set through a certain number of cluster (assume k clusters) .
18. K-means clustering algorithm
Algorithmic steps for k-means clustering
Given k, the k-means algorithm is implemented in four
steps:
Partition objects into k nonempty subsets
Compute seed points as the centroids of the clusters of the current
partition (the centroid is the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest seed point
Go back to Step 2, stop when no more new assignment
22. K-means clustering
Advantages Disadvantages
• Simple, understandable
• items automatically
assigned to clusters
• Must pick number of
clusters before hand
• Often terminates at a
local optimum.
• All items forced into a
cluster
• Too sensitive to outliers
23. What Is the Problem of the K-Means ?
The k-means algorithm is sensitive to outliers !
Since an object with an extremely large value may substantially distort the
distribution of the data
K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object in
a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
24. Summery
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
Outlier detection and analysis are very useful for fraud detection,
etc. and can be performed by statistical, distance-based or
deviation-based approaches
There are still lots of research issues on cluster analysis, such as
constraint-based clustering
25. References
Data Mining Next Generation Challenges & Future
Directions HillolKargupta, AnupamJoshi, Yelena Yesha,
Krishnamoorthy Sivakumar PHI 9.
Data Mining Concepts & Techniques Jiawei Han, Mining
Techniques and Trends N.P Gopalan, B. Sivasalvan PHI
https://sites.google.com/
https://towardsdatascience.com/
https://www.tutorialspoint.com