SlideShare a Scribd company logo
1 of 60
Download to read offline
Cluster Analysis
Cluster Analysis
•Unsupervised classification
•Aims to decompose or partition a data set in to clusters
such that each cluster is similar within itself but is as
dissimilar as possible to other clusters.
•Inter-cluster (between-groups) distance is maximized
and intra-cluster (within-group) distance is minimized.
•Different to classification as there are no predefined
classes, no training data, may not even know how many
clusters.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Cluster Analysis
• Not a new field, a branch of statistics
• Many old algorithms are available
• Renewed interest due to data mining and machine
learning
• New algorithms are being developed, in particular to
deal with large amounts of data
• Evaluating a clustering method is difficult
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Classification Example
(From text by Roiger and Geatz)
Sore Swollen
Throat Fever Glands Congestion Headache Diagnosis
Yes Yes Yes Yes Yes Strep throat
No No No Yes Yes Allergy
Yes Yes No Yes No Cold
Yes No Yes No No Strep throat
No Yes No Yes No Cold
No No No Yes No Allergy
No No Yes No No Strep throat
Yes No No Yes Yes Allergy
No Yes No Yes Yes Cold
Yes Yes No Yes Yes Cold
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Clustering Example
(From text by Roiger and Geatz)
Sore Swollen
Throat Fever Glands Congestion Headache
Yes Yes Yes Yes Yes
No No No Yes Yes
Yes Yes No Yes No
Yes No Yes No No
No Yes No Yes No
No No No Yes No
No No Yes No No
Yes No No Yes Yes
No Yes No Yes Yes
Yes Yes No Yes Yes
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Cluster Analysis
•As in classification, each object has several attributes.
•Some methods require that the number of clusters is
specified by the user and a seed to start each cluster be
given.
•The objects are then clustered based on self-similarity.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Questions
• How to define a cluster?
• How to find if objects are similar or not?
• How to define some concept of distance between
individual objects and sets of objects?
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Types of Data
•Interval-scaled data - continuous variables on a
roughly linear scale e.g. marks, age, weight. Units can
affect the analysis. Distance in metres is obviously a
larger number than in kilometres. May need to scale or
normalise data (how?).
•Binary data – many variables are binary e.g. gender,
married or not, u/g or p/g. How to compute distance
between binary variables?
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Types of Data
•Nominal data - similar to binary but can take more
than two states e.g. colour, staff position. How to
compute distance between two objects with nominal
data?
•Ordinal data - similar to nominal but the different
values are ordered in a meaningful sequence
•Ratio-scaled data - nonlinear scale data
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance
A simple, well-understood concept. Distance has the
following properties (assume x and y to be vectors):
• distance is always positive
• distance x to x is zero
• distance x to y cannot be greater than the sum of
distance x to z and z to y
• distance x to y is the same as from y to x.
Examples: absolute value of the difference ∑|x-y| (the
Manhattan distance ) or ∑(x-y)**2 (the Euclidean
distance).
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance
Three common distance measures are;
1. Manhattan Distance or the absolute value of the
difference
D(x, y) = ∑|x-y|
2. Euclidean distance
D(x, y) = ( ∑(x-y)2 ) ½
3. Maximum difference
D(x, y) = maxi |xi – yi|
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance between clusters
A more difficult concept.
•Single link
•Complete link
•Average link
We will return to this
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Categorical Data
Difficult to find a sensible distance measure between
say China and Australia or between a male and a
female or between a student doing BIT and one doing
MIT.
One possibility is to convert all categorical data into
numeric data (makes sense?). Another possibility is to
treat distance between categorical values as a function
which only has a value 1 or 0.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance
Using the approach of treating distance between two
categorical values as a value 1 or 0, it is possible to
compare two objects that have all attribute values
categorical.
Essentially it is just comparing how many attribute
value are the same.
Consider the example on the next page and compare one
patient with another.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance
(From text by Roiger and Geatz)
Sore Swollen
Throat Fever Glands Congestion Headache
Yes Yes Yes Yes Yes
No No No Yes Yes
Yes Yes No Yes No
Yes No Yes No No
No Yes No Yes No
No No No Yes No
No No Yes No No
Yes No No Yes Yes
No Yes No Yes Yes
Yes Yes No Yes Yes
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Types of Data
One possibility:
q = number of attributes for which both 1 (or yes)
t = number of attributes for which both 0 (or no)
r, s when attribute values do not match
d = (r + s)/(r + s + q + t)
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance Measure
Need to carefully look at scales of different attributes’
values. For example, the GPA varies from 0 to 4 while
the age attribute varies from say 20 to 50.
The total distance, the sum of distance between all
attribute values, is then dominated by the age attribute
and will be affected little by the variations in GPA. To
avoid such problems the data should be scaled or
normalized (perhaps so that all data varies between -1
and 1).
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance Measure
The distance measures we have discussed so far
assume equal importance being given to all attributes.
One may want to assign weights to all attributes
indicating their importance. The distance measure will
then need to use these weights in computing the total
distance.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Clustering
It is not always easy to predict how many clusters to
expect from a set of data.
One may choose a number of clusters based on some
knowledge of data. The result may be an acceptable set
of clusters. This does not however mean that a
different number of clusters will not provide an even
better set of clusters. Choosing the number of clusters
is a difficult problem.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Types of Methods
•Partitioning methods – given n objects, make k (n)
partitions (clusters) of data and use iterative relocation
method. It is assumed that each cluster has at least
one object and each object belongs to only one cluster.
•Hierarchical methods - start with one cluster and then
split it in smaller clusters or start with each object in
an individual cluster and then try to merge similar
clusters.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Types of Methods
•Density-based methods - for each data point in a cluster,
at least a minimum number of points must exist within a
given radius
•Grid-based methods - object space is divided into a grid
•Model-based methods - a model is assumed, perhaps
based on a probability distribution
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
The K-Means Method
Perhaps the most commonly used method.
The method involves choosing the number of clusters
and then dividing the given n objects choosing k seeds
randomly as starting samples of these clusters.
Once the seeds have been specified, each member is
assigned to a cluster that is closest to it based on some
distance measure. Once all the members have been
allocated, the mean value of each cluster is computed
and these means essentially become the new seeds.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
The K-Means Method
Using the mean value of each cluster, all the members
are now re-allocated to the clusters. In most situations,
some members will change clusters unless the first
guesses of seeds were very good.
This process continues until no changes take place to
cluster memberships.
Different starting seeds obviously may lead to different
clusters.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
The K-Means Method
Essentially the algorithm tries to build cluster with
high level of similarity within clusters and low level of
similarity between clusters. Similarity measurement is
based on the mean values and the algorithm tries to
minimize the squared-error function.
This method requires means of variables to be
computed.
The method is scalable and is efficient. The algorithm
does not find a global minimum, it rather terminates at
a local minimum.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
Student Age Marks1 Marks2 Marks3
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
S4 20 55 55 55
S5 22 85 86 87
S6 19 91 90 89
S7 20 70 65 65
S8 21 53 56 59
S9 19 82 82 60
S10 40 76 60 78
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
Let the three seeds be the first three records:
Now we compute the distances between the objects
based on the four attributes. K-Means requires
Euclidean distances but we use the sum of absolute
differences as well as to show how the distance metric
can change the results. Next page table shows the
distances and their allocation to the nearest neighbor.
Student Age Mark1 Mark2 Mark3
S1 18 73 75 57
S2 18 79 85 75
S3 23 70 70 52
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distances
Student Age Marks1 Marks2 Marks3 Manhattan Euclidean
Distance Distance
S1 18 73 75 57 Seed 1 C1 Seed 1 C1
S2 18 79 85 75 Seed 2 C2 Seed 2 C2
S3 23 70 70 52 Seed 3 C3 Seed 3 C3
S4 20 55 55 55 42/76/36 C3 27/43/22 C3
S5 22 85 86 87 57/23/67 C2 34/14/41 C2
S6 19 91 90 89 66/32/82 C2 40/19/47 C2
S7 20 70 65 60 23/41/21 C3 13/24/14 C1
S8 21 53 56 59 44/74/40 C3 28/42/23 C3
S9 19 82 82 60 20/22/36 C1 12/16/19 C1
S10 47 75 76 77 60/52/58 C2 33/34/32 C3
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
Note the different allocations by the two different
distance metrics. K-Means uses the Euclidean distance.
The means of the new clusters are also different as
shown on the next slide. We now start with the new
means and compute Euclidean distances. The process
continues until there is no change.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Cluster Means
Cluster Age Mark1 Mark2 Mark3
Manhattan C1 18.5 77.5 78.5 58.5
C2 24.8 82.8 80.3 82
C3 21 62 61.5 57.8
Seed 1 18 73 75 57
Seed 2 18 79 85 75
Seed 3 23 70 70 52
Euclidean C1 19.0 75.0 74.0 60.7
C2 19.7 85.0 87.0 83.7
C3 26.0 63.5 60.3 60.8
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distances
Student Age Marks1 Marks2 Marks3 Distance Cluster
C1 18.5 77.5 78.5 58.5
C2 24.8 82.8 80.3 82
C3 21 62 61.5 57.8
S1 18 73 75 57 8/52/36 C1
S2 18 79 85 75 30/18/63 C2
S3 23 70 70 52 22/67/28 C1
S4 20 55 55 55 46/91/26 C3
S5 22 85 86 87 51/7/78 C2
S6 19 91 90 89 60/15/93 C2
S7 20 70 65 60 19/56/22 C1
S8 21 53 56 59 44/89/22 C3
S9 19 82 82 60 16/32/48 C1
S10 47 75 76 77 52/63/43 C3
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Comments
•Euclidean Distance used in the last slide since that is
what K-Means requires.
•The results of using Manhattan distance and Euclidean
distance are quite different
•The last iteration changed the cluster membership of S3
from C3 to C1. New means are now computed followed
by computation of new distances. The process continues
until there is no change.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance Metric
Most important issue in methods like the K-Means
method is the distance metric although K-Means is
married to Euclidean distance.
There is no real definition of what metric is a good one
but the example shows that different metrics, used on the
same data, can produce different results making it very
difficult to decide which result is the best.
Whatever metric is used is somewhat arbitrary although
extremely important. Euclidean distance is used in the
next example.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
Example deals with data about countries.
Three seeds, France, Indonesia and Thailand were
chosen randomly. Euclidean distances were computed
and nearest neighbors were put together in clusters.
The result is shown on the second slide. Another iteration
did not change the clusters.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Another Example
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Notes
• The scales of the attributes in the examples are different.
Impacts the results.
• The attributes are not independent. For example, strong
correlation between Olympic medals and income.
• The examples are very simple and too small to illustrate
the methods fully.
• Number of clusters - can be difficult since there is no
generally accepted procedure for determining the
number. May be use trial and error and then decide
based on within-group and between-group distances.
• How to interpret results?
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Validating Clusters
Difficult problem.
Statistical testing may be possible – could test a
hypothesis that there are no clusters.
May be data is available where clusters are known.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Hierarchical Clustering
Quite different than partitioning. Involves gradually
merging different objects into clusters (called
agglomerative) or dividing large clusters into smaller
ones (called divisive).
We consider one method.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Agglomerative Clustering
The algorithm normally starts with each cluster
consisting of a single data point. Using a measure of
distance, the algorithm merges two clusters that are
nearest, thus reducing the number of clusters. The
process continues until all the data points are in one
cluster.
The algorithm requires that we be able to compute
distances between two objects and also between two
clusters.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance between Clusters
Single Link method – nearest neighbour - distance
between two clusters is the distance between the two
closest points, one from each cluster. Chains can be
formed (why?) and a long string of points may be
assigned to the same cluster.
Complete Linkage method – furthest neighbour –
distance between two clusters is the distance between
the two furthest points, one from each cluster. Does not
allow chains.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance between Clusters
Centroid method – distance between two clusters is the
distance between the two centroids or the two centres of
gravity of the two clusters.
Unweighted pair-group average method –distance
between two clusters is the average distance between all
pairs of objects in the two clusters. This means p*n
distances need to be computed if p and n are the number
of objects in each of the clusters
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Algorithm
• Each object is assigned to a cluster of its own
• Build a distance matrix by computing distances between
every pair of objects
• Find the two nearest clusters
• Merge them and remove them from the list
• Update matrix by including distance of all objects from the
new cluster
• Continue until all objects belong to one cluster
The next slide shows an example of the final result of
hierarchical clustering.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Hierarchical Clustering
http://www.resample.com/xlminer/help/HClst/HClst_ex.htm
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
We now consider the simple example about students’
marks and use the distance between two objects as the
Manhattan distance.
We will use the centroid method for distance between
clusters.
We first calculate a matrix of distances.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
Student Age Marks1 Marks2 Marks3
971234 18 73 75 57
994321 18 79 85 75
965438 23 70 70 52
987654 20 55 55 55
968765 22 85 86 87
978765 19 91 90 89
984567 20 70 65 60
985555 21 53 56 59
998888 19 82 82 60
995544 47 75 76 77
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance Matrix
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Example
• The matrix gives distance of each object with every other
object.
• The smallest distance of 8 is between object 4 and
object 8. They are combined and put where object 4
was.
• Compute distances from this cluster and update the
distance matrix.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Updated Matrix
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Note
• The smallest distance now is 15 between the objects 5
and 6. They are combined in a cluster and 5 and 6 are
removed.
• Compute distances from this cluster and update the
distance matrix.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Updated Matrix
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Next
• Look at shortest distance again.
• S3 and S7 are at a distance 16 apart. We merge them
and put C3 where S3 was.
• The updated distance matrix is given on the next slide. It
shows that C2 and S1 have the smallest distance and
are then merged in the next step.
• We stop short of finishing the example.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Updated matrix
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Another Example
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Distance Matrix (Euclidean Distance)
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Next steps
• Malaysia and Philippines have the shortest distance
(France and UK are also very close).
• In the next step we combine Malaysia and Philippines
and replace the two countries by this cluster.
• Then update the distance matrix.
• Then choose the shortest distance and so on until all
countries are in one large cluster.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Divisive Clustering
Opposite to the last method. Less commonly used.
We start with one cluster that has all the objects and
seek to split the cluster into components which
themselves are then split into even smaller components.
The divisive method may be based on using one variable
at a time or using all the variables together.
The algorithm terminates when a termination condition
is met or when each cluster has only one object.
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Hierarchical Clustering
• The ordering produced can be useful in gaining some
insight into the data.
• The major difficulty is that once an object is allocated to
a cluster it cannot be moved to another cluster even if
the initial allocation was incorrect
• Different distance metrics can produce different results
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Challenges
•Scalability - would the method work with millions of
objects (e.g. Cluster analysis of Web pages)?
•Ability to deal with different types of attributes
•Discovery of clusters with arbitrary shape - often
algorithms discover spherical clusters which are not
always suitable
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Challenges
•Minimal requirements for domain knowledge to
determine input parameters - often input
parameters are required e.g. how many clusters.
Algorithms can be sensitive to inputs.
•Ability to deal with noisy or missing data
•Insensitivity to the order of input records - order of
input should not influence the result of the
algorithm
Dr. Amit Kumar, Dept of CSE,
JUET, Guna
Challenges
•High dimensionality - data can have many attributes
e.g. Web pages
•Constraint-based clustering - sometime clusters may
need to satisfy constraints

More Related Content

What's hot

Feature Extraction and Principal Component Analysis
Feature Extraction and Principal Component AnalysisFeature Extraction and Principal Component Analysis
Feature Extraction and Principal Component AnalysisSayed Abulhasan Quadri
 
PhD Thesis Proposal
PhD Thesis Proposal PhD Thesis Proposal
PhD Thesis Proposal Ziqiang Feng
 
Quantum computing in machine learning
Quantum computing in machine learningQuantum computing in machine learning
Quantum computing in machine learningkhalidhassan105
 
Artificial Intelligence in Small Embedded System
Artificial Intelligence in Small Embedded SystemArtificial Intelligence in Small Embedded System
Artificial Intelligence in Small Embedded SystemGlobalLogic Ukraine
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering methodrajshreemuthiah
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Rupak Roy
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Quantum Computing and Quantum Supremacy at Google
Quantum Computing and Quantum Supremacy at GoogleQuantum Computing and Quantum Supremacy at Google
Quantum Computing and Quantum Supremacy at Googleinside-BigData.com
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]taeseon ryu
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxColleen Farrelly
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision makingShankha Goswami
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reductionMarco Quartulli
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxJebaRaj26
 
Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!Andreas Dewes
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hakky St
 

What's hot (20)

Feature Extraction and Principal Component Analysis
Feature Extraction and Principal Component AnalysisFeature Extraction and Principal Component Analysis
Feature Extraction and Principal Component Analysis
 
PhD Thesis Proposal
PhD Thesis Proposal PhD Thesis Proposal
PhD Thesis Proposal
 
Yolov3
Yolov3Yolov3
Yolov3
 
Quantum computing in machine learning
Quantum computing in machine learningQuantum computing in machine learning
Quantum computing in machine learning
 
Artificial Intelligence in Small Embedded System
Artificial Intelligence in Small Embedded SystemArtificial Intelligence in Small Embedded System
Artificial Intelligence in Small Embedded System
 
Deep learning
Deep learningDeep learning
Deep learning
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Grid based method & model based clustering method
Grid based method & model based clustering methodGrid based method & model based clustering method
Grid based method & model based clustering method
 
Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest Machine Learning Feature Selection - Random Forest
Machine Learning Feature Selection - Random Forest
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Quantum Computing and Quantum Supremacy at Google
Quantum Computing and Quantum Supremacy at GoogleQuantum Computing and Quantum Supremacy at Google
Quantum Computing and Quantum Supremacy at Google
 
[2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review][2020 CVPR Efficient DET paper review]
[2020 CVPR Efficient DET paper review]
 
An introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptxAn introduction to quantum machine learning.pptx
An introduction to quantum machine learning.pptx
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
 
KNN
KNN KNN
KNN
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptx
 
Let's build a quantum computer!
Let's build a quantum computer!Let's build a quantum computer!
Let's build a quantum computer!
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
Hands-On Machine Learning with Scikit-Learn and TensorFlow - Chapter8
 

Similar to Cluster

Introduction to Clustering . pptx
Introduction    to     Clustering . pptxIntroduction    to     Clustering . pptx
Introduction to Clustering . pptxHarsha Patel
 
Predicting Students Performance using K-Median Clustering
Predicting Students Performance using  K-Median ClusteringPredicting Students Performance using  K-Median Clustering
Predicting Students Performance using K-Median ClusteringIIRindia
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysisKritika Jain
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01deepti gupta
 
2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysis2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysisAshish965416
 
3Data summarization.pptx
3Data summarization.pptx3Data summarization.pptx
3Data summarization.pptxAmanuelMerga
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test Meghna Baid
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdfAmanuelDina
 
Intelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma Detection
Intelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma DetectionIntelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma Detection
Intelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma DetectionAditya pavan kumar
 
cluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhjcluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhjKaranSingh784447
 
An Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningAn Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningGina Rizzo
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 

Similar to Cluster (20)

Classification
ClassificationClassification
Classification
 
Introduction to Clustering . pptx
Introduction    to     Clustering . pptxIntroduction    to     Clustering . pptx
Introduction to Clustering . pptx
 
Predicting Students Performance using K-Median Clustering
Predicting Students Performance using  K-Median ClusteringPredicting Students Performance using  K-Median Clustering
Predicting Students Performance using K-Median Clustering
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysis
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01
 
Clusteranalysis
Clusteranalysis Clusteranalysis
Clusteranalysis
 
Unit 5-1.pdf
Unit 5-1.pdfUnit 5-1.pdf
Unit 5-1.pdf
 
2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysis2.7.21 sampling methods data analysis
2.7.21 sampling methods data analysis
 
3Data summarization.pptx
3Data summarization.pptx3Data summarization.pptx
3Data summarization.pptx
 
Ijetr021251
Ijetr021251Ijetr021251
Ijetr021251
 
Marketing Research Project on T test
Marketing Research Project on T test Marketing Research Project on T test
Marketing Research Project on T test
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Rohit 10103543
Rohit 10103543Rohit 10103543
Rohit 10103543
 
3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf3Measurements of health and disease_MCTD.pdf
3Measurements of health and disease_MCTD.pdf
 
Intelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma Detection
Intelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma DetectionIntelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma Detection
Intelligent Fuzzy System Based Dermoscopic Segmentation for Melanoma Detection
 
Unit 2 - Statistics
Unit 2 - StatisticsUnit 2 - Statistics
Unit 2 - Statistics
 
cluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhjcluster analysis(1).pptxbfdhdhhthjhfghhj
cluster analysis(1).pptxbfdhdhhthjhfghhj
 
An Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data MiningAn Analysis On Clustering Algorithms In Data Mining
An Analysis On Clustering Algorithms In Data Mining
 
Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 

More from Amit Kumar Rathi

Hybrid Systems using Fuzzy, NN and GA (Soft Computing)
Hybrid Systems using Fuzzy, NN and GA (Soft Computing)Hybrid Systems using Fuzzy, NN and GA (Soft Computing)
Hybrid Systems using Fuzzy, NN and GA (Soft Computing)Amit Kumar Rathi
 
Fundamentals of Genetic Algorithms (Soft Computing)
Fundamentals of Genetic Algorithms (Soft Computing)Fundamentals of Genetic Algorithms (Soft Computing)
Fundamentals of Genetic Algorithms (Soft Computing)Amit Kumar Rathi
 
Fuzzy Systems by using fuzzy set (Soft Computing)
Fuzzy Systems by using fuzzy set (Soft Computing)Fuzzy Systems by using fuzzy set (Soft Computing)
Fuzzy Systems by using fuzzy set (Soft Computing)Amit Kumar Rathi
 
Fuzzy Set Theory and Classical Set Theory (Soft Computing)
Fuzzy Set Theory and Classical Set Theory (Soft Computing)Fuzzy Set Theory and Classical Set Theory (Soft Computing)
Fuzzy Set Theory and Classical Set Theory (Soft Computing)Amit Kumar Rathi
 
Associative Memory using NN (Soft Computing)
Associative Memory using NN (Soft Computing)Associative Memory using NN (Soft Computing)
Associative Memory using NN (Soft Computing)Amit Kumar Rathi
 
Back Propagation Network (Soft Computing)
Back Propagation Network (Soft Computing)Back Propagation Network (Soft Computing)
Back Propagation Network (Soft Computing)Amit Kumar Rathi
 
Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)Amit Kumar Rathi
 
Introduction to Soft Computing (intro to the building blocks of SC)
Introduction to Soft Computing (intro to the building blocks of SC)Introduction to Soft Computing (intro to the building blocks of SC)
Introduction to Soft Computing (intro to the building blocks of SC)Amit Kumar Rathi
 
Sccd and topological sorting
Sccd and topological sortingSccd and topological sorting
Sccd and topological sortingAmit Kumar Rathi
 
Recurrence and master theorem
Recurrence and master theoremRecurrence and master theorem
Recurrence and master theoremAmit Kumar Rathi
 

More from Amit Kumar Rathi (20)

Hybrid Systems using Fuzzy, NN and GA (Soft Computing)
Hybrid Systems using Fuzzy, NN and GA (Soft Computing)Hybrid Systems using Fuzzy, NN and GA (Soft Computing)
Hybrid Systems using Fuzzy, NN and GA (Soft Computing)
 
Fundamentals of Genetic Algorithms (Soft Computing)
Fundamentals of Genetic Algorithms (Soft Computing)Fundamentals of Genetic Algorithms (Soft Computing)
Fundamentals of Genetic Algorithms (Soft Computing)
 
Fuzzy Systems by using fuzzy set (Soft Computing)
Fuzzy Systems by using fuzzy set (Soft Computing)Fuzzy Systems by using fuzzy set (Soft Computing)
Fuzzy Systems by using fuzzy set (Soft Computing)
 
Fuzzy Set Theory and Classical Set Theory (Soft Computing)
Fuzzy Set Theory and Classical Set Theory (Soft Computing)Fuzzy Set Theory and Classical Set Theory (Soft Computing)
Fuzzy Set Theory and Classical Set Theory (Soft Computing)
 
Associative Memory using NN (Soft Computing)
Associative Memory using NN (Soft Computing)Associative Memory using NN (Soft Computing)
Associative Memory using NN (Soft Computing)
 
Back Propagation Network (Soft Computing)
Back Propagation Network (Soft Computing)Back Propagation Network (Soft Computing)
Back Propagation Network (Soft Computing)
 
Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)Fundamentals of Neural Network (Soft Computing)
Fundamentals of Neural Network (Soft Computing)
 
Introduction to Soft Computing (intro to the building blocks of SC)
Introduction to Soft Computing (intro to the building blocks of SC)Introduction to Soft Computing (intro to the building blocks of SC)
Introduction to Soft Computing (intro to the building blocks of SC)
 
Topological sorting
Topological sortingTopological sorting
Topological sorting
 
String matching, naive,
String matching, naive,String matching, naive,
String matching, naive,
 
Shortest path algorithms
Shortest path algorithmsShortest path algorithms
Shortest path algorithms
 
Sccd and topological sorting
Sccd and topological sortingSccd and topological sorting
Sccd and topological sorting
 
Red black trees
Red black treesRed black trees
Red black trees
 
Recurrence and master theorem
Recurrence and master theoremRecurrence and master theorem
Recurrence and master theorem
 
Rabin karp string matcher
Rabin karp string matcherRabin karp string matcher
Rabin karp string matcher
 
Minimum spanning tree
Minimum spanning treeMinimum spanning tree
Minimum spanning tree
 
Merge sort analysis
Merge sort analysisMerge sort analysis
Merge sort analysis
 
Loop invarient
Loop invarientLoop invarient
Loop invarient
 
Linear sort
Linear sortLinear sort
Linear sort
 
Heap and heapsort
Heap and heapsortHeap and heapsort
Heap and heapsort
 

Recently uploaded

Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSrknatarajan
 

Recently uploaded (20)

Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 

Cluster

  • 2. Cluster Analysis •Unsupervised classification •Aims to decompose or partition a data set in to clusters such that each cluster is similar within itself but is as dissimilar as possible to other clusters. •Inter-cluster (between-groups) distance is maximized and intra-cluster (within-group) distance is minimized. •Different to classification as there are no predefined classes, no training data, may not even know how many clusters. Dr. Amit Kumar, Dept of CSE, JUET, Guna
  • 3. Dr. Amit Kumar, Dept of CSE, JUET, Guna Cluster Analysis • Not a new field, a branch of statistics • Many old algorithms are available • Renewed interest due to data mining and machine learning • New algorithms are being developed, in particular to deal with large amounts of data • Evaluating a clustering method is difficult
  • 4. Dr. Amit Kumar, Dept of CSE, JUET, Guna Classification Example (From text by Roiger and Geatz) Sore Swollen Throat Fever Glands Congestion Headache Diagnosis Yes Yes Yes Yes Yes Strep throat No No No Yes Yes Allergy Yes Yes No Yes No Cold Yes No Yes No No Strep throat No Yes No Yes No Cold No No No Yes No Allergy No No Yes No No Strep throat Yes No No Yes Yes Allergy No Yes No Yes Yes Cold Yes Yes No Yes Yes Cold
  • 5. Dr. Amit Kumar, Dept of CSE, JUET, Guna Clustering Example (From text by Roiger and Geatz) Sore Swollen Throat Fever Glands Congestion Headache Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No Yes No Yes No Yes No No No Yes No Yes No No No No Yes No No No Yes No No Yes No No Yes Yes No Yes No Yes Yes Yes Yes No Yes Yes
  • 6. Dr. Amit Kumar, Dept of CSE, JUET, Guna Cluster Analysis •As in classification, each object has several attributes. •Some methods require that the number of clusters is specified by the user and a seed to start each cluster be given. •The objects are then clustered based on self-similarity.
  • 7. Dr. Amit Kumar, Dept of CSE, JUET, Guna Questions • How to define a cluster? • How to find if objects are similar or not? • How to define some concept of distance between individual objects and sets of objects?
  • 8. Dr. Amit Kumar, Dept of CSE, JUET, Guna Types of Data •Interval-scaled data - continuous variables on a roughly linear scale e.g. marks, age, weight. Units can affect the analysis. Distance in metres is obviously a larger number than in kilometres. May need to scale or normalise data (how?). •Binary data – many variables are binary e.g. gender, married or not, u/g or p/g. How to compute distance between binary variables?
  • 9. Dr. Amit Kumar, Dept of CSE, JUET, Guna Types of Data •Nominal data - similar to binary but can take more than two states e.g. colour, staff position. How to compute distance between two objects with nominal data? •Ordinal data - similar to nominal but the different values are ordered in a meaningful sequence •Ratio-scaled data - nonlinear scale data
  • 10. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance A simple, well-understood concept. Distance has the following properties (assume x and y to be vectors): • distance is always positive • distance x to x is zero • distance x to y cannot be greater than the sum of distance x to z and z to y • distance x to y is the same as from y to x. Examples: absolute value of the difference ∑|x-y| (the Manhattan distance ) or ∑(x-y)**2 (the Euclidean distance).
  • 11. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance Three common distance measures are; 1. Manhattan Distance or the absolute value of the difference D(x, y) = ∑|x-y| 2. Euclidean distance D(x, y) = ( ∑(x-y)2 ) ½ 3. Maximum difference D(x, y) = maxi |xi – yi|
  • 12. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance between clusters A more difficult concept. •Single link •Complete link •Average link We will return to this
  • 13. Dr. Amit Kumar, Dept of CSE, JUET, Guna Categorical Data Difficult to find a sensible distance measure between say China and Australia or between a male and a female or between a student doing BIT and one doing MIT. One possibility is to convert all categorical data into numeric data (makes sense?). Another possibility is to treat distance between categorical values as a function which only has a value 1 or 0.
  • 14. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance Using the approach of treating distance between two categorical values as a value 1 or 0, it is possible to compare two objects that have all attribute values categorical. Essentially it is just comparing how many attribute value are the same. Consider the example on the next page and compare one patient with another.
  • 15. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance (From text by Roiger and Geatz) Sore Swollen Throat Fever Glands Congestion Headache Yes Yes Yes Yes Yes No No No Yes Yes Yes Yes No Yes No Yes No Yes No No No Yes No Yes No No No No Yes No No No Yes No No Yes No No Yes Yes No Yes No Yes Yes Yes Yes No Yes Yes
  • 16. Dr. Amit Kumar, Dept of CSE, JUET, Guna Types of Data One possibility: q = number of attributes for which both 1 (or yes) t = number of attributes for which both 0 (or no) r, s when attribute values do not match d = (r + s)/(r + s + q + t)
  • 17. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance Measure Need to carefully look at scales of different attributes’ values. For example, the GPA varies from 0 to 4 while the age attribute varies from say 20 to 50. The total distance, the sum of distance between all attribute values, is then dominated by the age attribute and will be affected little by the variations in GPA. To avoid such problems the data should be scaled or normalized (perhaps so that all data varies between -1 and 1).
  • 18. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance Measure The distance measures we have discussed so far assume equal importance being given to all attributes. One may want to assign weights to all attributes indicating their importance. The distance measure will then need to use these weights in computing the total distance.
  • 19. Dr. Amit Kumar, Dept of CSE, JUET, Guna Clustering It is not always easy to predict how many clusters to expect from a set of data. One may choose a number of clusters based on some knowledge of data. The result may be an acceptable set of clusters. This does not however mean that a different number of clusters will not provide an even better set of clusters. Choosing the number of clusters is a difficult problem.
  • 20. Dr. Amit Kumar, Dept of CSE, JUET, Guna Types of Methods •Partitioning methods – given n objects, make k (n) partitions (clusters) of data and use iterative relocation method. It is assumed that each cluster has at least one object and each object belongs to only one cluster. •Hierarchical methods - start with one cluster and then split it in smaller clusters or start with each object in an individual cluster and then try to merge similar clusters.
  • 21. Dr. Amit Kumar, Dept of CSE, JUET, Guna Types of Methods •Density-based methods - for each data point in a cluster, at least a minimum number of points must exist within a given radius •Grid-based methods - object space is divided into a grid •Model-based methods - a model is assumed, perhaps based on a probability distribution
  • 22. Dr. Amit Kumar, Dept of CSE, JUET, Guna The K-Means Method Perhaps the most commonly used method. The method involves choosing the number of clusters and then dividing the given n objects choosing k seeds randomly as starting samples of these clusters. Once the seeds have been specified, each member is assigned to a cluster that is closest to it based on some distance measure. Once all the members have been allocated, the mean value of each cluster is computed and these means essentially become the new seeds.
  • 23. Dr. Amit Kumar, Dept of CSE, JUET, Guna The K-Means Method Using the mean value of each cluster, all the members are now re-allocated to the clusters. In most situations, some members will change clusters unless the first guesses of seeds were very good. This process continues until no changes take place to cluster memberships. Different starting seeds obviously may lead to different clusters.
  • 24. Dr. Amit Kumar, Dept of CSE, JUET, Guna The K-Means Method Essentially the algorithm tries to build cluster with high level of similarity within clusters and low level of similarity between clusters. Similarity measurement is based on the mean values and the algorithm tries to minimize the squared-error function. This method requires means of variables to be computed. The method is scalable and is efficient. The algorithm does not find a global minimum, it rather terminates at a local minimum.
  • 25. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example Student Age Marks1 Marks2 Marks3 S1 18 73 75 57 S2 18 79 85 75 S3 23 70 70 52 S4 20 55 55 55 S5 22 85 86 87 S6 19 91 90 89 S7 20 70 65 65 S8 21 53 56 59 S9 19 82 82 60 S10 40 76 60 78
  • 26. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example Let the three seeds be the first three records: Now we compute the distances between the objects based on the four attributes. K-Means requires Euclidean distances but we use the sum of absolute differences as well as to show how the distance metric can change the results. Next page table shows the distances and their allocation to the nearest neighbor. Student Age Mark1 Mark2 Mark3 S1 18 73 75 57 S2 18 79 85 75 S3 23 70 70 52
  • 27. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distances Student Age Marks1 Marks2 Marks3 Manhattan Euclidean Distance Distance S1 18 73 75 57 Seed 1 C1 Seed 1 C1 S2 18 79 85 75 Seed 2 C2 Seed 2 C2 S3 23 70 70 52 Seed 3 C3 Seed 3 C3 S4 20 55 55 55 42/76/36 C3 27/43/22 C3 S5 22 85 86 87 57/23/67 C2 34/14/41 C2 S6 19 91 90 89 66/32/82 C2 40/19/47 C2 S7 20 70 65 60 23/41/21 C3 13/24/14 C1 S8 21 53 56 59 44/74/40 C3 28/42/23 C3 S9 19 82 82 60 20/22/36 C1 12/16/19 C1 S10 47 75 76 77 60/52/58 C2 33/34/32 C3
  • 28. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example Note the different allocations by the two different distance metrics. K-Means uses the Euclidean distance. The means of the new clusters are also different as shown on the next slide. We now start with the new means and compute Euclidean distances. The process continues until there is no change.
  • 29. Dr. Amit Kumar, Dept of CSE, JUET, Guna Cluster Means Cluster Age Mark1 Mark2 Mark3 Manhattan C1 18.5 77.5 78.5 58.5 C2 24.8 82.8 80.3 82 C3 21 62 61.5 57.8 Seed 1 18 73 75 57 Seed 2 18 79 85 75 Seed 3 23 70 70 52 Euclidean C1 19.0 75.0 74.0 60.7 C2 19.7 85.0 87.0 83.7 C3 26.0 63.5 60.3 60.8
  • 30. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distances Student Age Marks1 Marks2 Marks3 Distance Cluster C1 18.5 77.5 78.5 58.5 C2 24.8 82.8 80.3 82 C3 21 62 61.5 57.8 S1 18 73 75 57 8/52/36 C1 S2 18 79 85 75 30/18/63 C2 S3 23 70 70 52 22/67/28 C1 S4 20 55 55 55 46/91/26 C3 S5 22 85 86 87 51/7/78 C2 S6 19 91 90 89 60/15/93 C2 S7 20 70 65 60 19/56/22 C1 S8 21 53 56 59 44/89/22 C3 S9 19 82 82 60 16/32/48 C1 S10 47 75 76 77 52/63/43 C3
  • 31. Dr. Amit Kumar, Dept of CSE, JUET, Guna Comments •Euclidean Distance used in the last slide since that is what K-Means requires. •The results of using Manhattan distance and Euclidean distance are quite different •The last iteration changed the cluster membership of S3 from C3 to C1. New means are now computed followed by computation of new distances. The process continues until there is no change.
  • 32. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance Metric Most important issue in methods like the K-Means method is the distance metric although K-Means is married to Euclidean distance. There is no real definition of what metric is a good one but the example shows that different metrics, used on the same data, can produce different results making it very difficult to decide which result is the best. Whatever metric is used is somewhat arbitrary although extremely important. Euclidean distance is used in the next example.
  • 33. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example Example deals with data about countries. Three seeds, France, Indonesia and Thailand were chosen randomly. Euclidean distances were computed and nearest neighbors were put together in clusters. The result is shown on the second slide. Another iteration did not change the clusters.
  • 34. Dr. Amit Kumar, Dept of CSE, JUET, Guna Another Example
  • 35. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example
  • 36. Dr. Amit Kumar, Dept of CSE, JUET, Guna Notes • The scales of the attributes in the examples are different. Impacts the results. • The attributes are not independent. For example, strong correlation between Olympic medals and income. • The examples are very simple and too small to illustrate the methods fully. • Number of clusters - can be difficult since there is no generally accepted procedure for determining the number. May be use trial and error and then decide based on within-group and between-group distances. • How to interpret results?
  • 37. Dr. Amit Kumar, Dept of CSE, JUET, Guna Validating Clusters Difficult problem. Statistical testing may be possible – could test a hypothesis that there are no clusters. May be data is available where clusters are known.
  • 38. Dr. Amit Kumar, Dept of CSE, JUET, Guna Hierarchical Clustering Quite different than partitioning. Involves gradually merging different objects into clusters (called agglomerative) or dividing large clusters into smaller ones (called divisive). We consider one method.
  • 39. Dr. Amit Kumar, Dept of CSE, JUET, Guna Agglomerative Clustering The algorithm normally starts with each cluster consisting of a single data point. Using a measure of distance, the algorithm merges two clusters that are nearest, thus reducing the number of clusters. The process continues until all the data points are in one cluster. The algorithm requires that we be able to compute distances between two objects and also between two clusters.
  • 40. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance between Clusters Single Link method – nearest neighbour - distance between two clusters is the distance between the two closest points, one from each cluster. Chains can be formed (why?) and a long string of points may be assigned to the same cluster. Complete Linkage method – furthest neighbour – distance between two clusters is the distance between the two furthest points, one from each cluster. Does not allow chains.
  • 41. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance between Clusters Centroid method – distance between two clusters is the distance between the two centroids or the two centres of gravity of the two clusters. Unweighted pair-group average method –distance between two clusters is the average distance between all pairs of objects in the two clusters. This means p*n distances need to be computed if p and n are the number of objects in each of the clusters
  • 42. Dr. Amit Kumar, Dept of CSE, JUET, Guna Algorithm • Each object is assigned to a cluster of its own • Build a distance matrix by computing distances between every pair of objects • Find the two nearest clusters • Merge them and remove them from the list • Update matrix by including distance of all objects from the new cluster • Continue until all objects belong to one cluster The next slide shows an example of the final result of hierarchical clustering.
  • 43. Dr. Amit Kumar, Dept of CSE, JUET, Guna Hierarchical Clustering http://www.resample.com/xlminer/help/HClst/HClst_ex.htm
  • 44. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example We now consider the simple example about students’ marks and use the distance between two objects as the Manhattan distance. We will use the centroid method for distance between clusters. We first calculate a matrix of distances.
  • 45. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example Student Age Marks1 Marks2 Marks3 971234 18 73 75 57 994321 18 79 85 75 965438 23 70 70 52 987654 20 55 55 55 968765 22 85 86 87 978765 19 91 90 89 984567 20 70 65 60 985555 21 53 56 59 998888 19 82 82 60 995544 47 75 76 77
  • 46. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance Matrix
  • 47. Dr. Amit Kumar, Dept of CSE, JUET, Guna Example • The matrix gives distance of each object with every other object. • The smallest distance of 8 is between object 4 and object 8. They are combined and put where object 4 was. • Compute distances from this cluster and update the distance matrix.
  • 48. Dr. Amit Kumar, Dept of CSE, JUET, Guna Updated Matrix
  • 49. Dr. Amit Kumar, Dept of CSE, JUET, Guna Note • The smallest distance now is 15 between the objects 5 and 6. They are combined in a cluster and 5 and 6 are removed. • Compute distances from this cluster and update the distance matrix.
  • 50. Dr. Amit Kumar, Dept of CSE, JUET, Guna Updated Matrix
  • 51. Dr. Amit Kumar, Dept of CSE, JUET, Guna Next • Look at shortest distance again. • S3 and S7 are at a distance 16 apart. We merge them and put C3 where S3 was. • The updated distance matrix is given on the next slide. It shows that C2 and S1 have the smallest distance and are then merged in the next step. • We stop short of finishing the example.
  • 52. Dr. Amit Kumar, Dept of CSE, JUET, Guna Updated matrix
  • 53. Dr. Amit Kumar, Dept of CSE, JUET, Guna Another Example
  • 54. Dr. Amit Kumar, Dept of CSE, JUET, Guna Distance Matrix (Euclidean Distance)
  • 55. Dr. Amit Kumar, Dept of CSE, JUET, Guna Next steps • Malaysia and Philippines have the shortest distance (France and UK are also very close). • In the next step we combine Malaysia and Philippines and replace the two countries by this cluster. • Then update the distance matrix. • Then choose the shortest distance and so on until all countries are in one large cluster.
  • 56. Dr. Amit Kumar, Dept of CSE, JUET, Guna Divisive Clustering Opposite to the last method. Less commonly used. We start with one cluster that has all the objects and seek to split the cluster into components which themselves are then split into even smaller components. The divisive method may be based on using one variable at a time or using all the variables together. The algorithm terminates when a termination condition is met or when each cluster has only one object.
  • 57. Dr. Amit Kumar, Dept of CSE, JUET, Guna Hierarchical Clustering • The ordering produced can be useful in gaining some insight into the data. • The major difficulty is that once an object is allocated to a cluster it cannot be moved to another cluster even if the initial allocation was incorrect • Different distance metrics can produce different results
  • 58. Dr. Amit Kumar, Dept of CSE, JUET, Guna Challenges •Scalability - would the method work with millions of objects (e.g. Cluster analysis of Web pages)? •Ability to deal with different types of attributes •Discovery of clusters with arbitrary shape - often algorithms discover spherical clusters which are not always suitable
  • 59. Dr. Amit Kumar, Dept of CSE, JUET, Guna Challenges •Minimal requirements for domain knowledge to determine input parameters - often input parameters are required e.g. how many clusters. Algorithms can be sensitive to inputs. •Ability to deal with noisy or missing data •Insensitivity to the order of input records - order of input should not influence the result of the algorithm
  • 60. Dr. Amit Kumar, Dept of CSE, JUET, Guna Challenges •High dimensionality - data can have many attributes e.g. Web pages •Constraint-based clustering - sometime clusters may need to satisfy constraints