CLUSTER ANALYSIS
DR ATHAR KHAN
LIAQUAT COLLEGE OF MEDICINE & DENTISTRY
matharm@yahoo.com
4/17/2020 DR ATHAR KHAN 2
DEFINITION
• Cluster Analysis is a way of grouping cases of data
based on the similarity of responses to several
variables.
▪ The fundamental problem clustering address is to
divide the data into meaningful groups (clusters).
Group Together Variables
Grouping Cases
Factor Analysis
Cluster Analysis
4/17/2020 DR ATHAR KHAN 3
4/17/2020 DR ATHAR KHAN 4
4/17/2020 DR ATHAR KHAN 5
4/17/2020 DR ATHAR KHAN 6
4/17/2020 DR ATHAR KHAN 7
4/17/2020 DR ATHAR KHAN 8
4/17/2020 DR ATHAR KHAN 9
4/17/2020 DR ATHAR KHAN 10
Cluster 1
Cluster 2
Cluster 3
4/17/2020 DR ATHAR KHAN 11
Unsupervised learning is a machine learning technique, where you do not need to
supervise the model. Instead, you need to allow the model to work on its own to
discover information, only have input data (X) and no corresponding output variables.4/17/2020 DR ATHAR KHAN 12
Types of Data
▪ The data used in cluster analysis can be interval,
ordinal or categorical.
▪ However, having a mixture of different types of
variable will make the analysis more complicated.
▪ This is because in cluster analysis you need to have
some way of measuring the distance between
observations and the type of measure used will
depend on what type of data you have.
4/17/2020 DR ATHAR KHAN 13
Measures of Distance
▪ A number of different measures have been proposed
to measure ’distance’ for categorical data:
▪ K-Means algorithm for categorical data, ROCK, LIMBO,
CLICKS, Ward’s agglomerativealgorithm
▪ In a hierarchical clustering algorithm most used is Ward’s.
▪ It is the most widely used method for measuring the
distance between the objects for interval data is
Euclidean Distance.
4/17/2020 DR ATHAR KHAN 14
Euclidean Distance, d
Euclidean distance is the geometric distance
between two objects (or cases). Therefore, if we
were to call George subject i and Zippy subject j,
then we could express their Euclidean distance in
terms of the following equation:
Euclidean distances the smaller the distance, the
more similar the cases.4/17/2020 DR ATHAR KHAN 15
Measures of Distance
▪ When using a measure such as the Euclidean
distance, the scale of measurement of the variables
under consideration is an issue, as changing the scale
will obviously effect the distance between subjects
(e.g. a difference of 10cm could being a difference of
100mm).
▪ To get around this problem each variable can be
standardized (converted to z-scores).
4/17/2020 DR ATHAR KHAN 16
Approaches to Cluster Analysis
▪ There are a number of different methods that can be
used to carry out a cluster analysis:
▪ Hierarchical methods
▪ – Agglomerative methods
▪ – Divisive methods
▪ Non-hierarchical methods (often known as k-means
clustering methods)
4/17/2020 DR ATHAR KHAN 17
Agglomerative Methods
▪ Agglomerative clustering is Bottom-up technique start by
considering each data point as its own cluster and
merging them together into larger groups from the
bottom up into a single giant cluster.
4/17/2020 DR ATHAR KHAN 18
Divisive Clustering
▪ Divisive clustering is the opposite, it starts with one
cluster, which is then divided in two as a function of the
similarities or distances in the data. These new clusters
are then divided, and so on until each case is a cluster.
Agglomerative
methods are
used more
often than
Divisive
methods
4/17/2020 DR ATHAR KHAN 19
4/17/2020 DR ATHAR KHAN 20
Hierarchical agglomerative methods
Within this approach to cluster analysis there are a number of different
methods used to determine which clusters should be joined at each stage.
Linkage Function/Creating the Clusters
4/17/2020 DR ATHAR KHAN 21
Nearest neighbour method (single linkage method)
In this method the distance between two clusters is defined to be the distance
between the two closest members, or neighbours.
Furthest neighbour method (complete linkage method)
In this case the distance between two clusters is defined to be the maximum
distance between members — i.e. the distance between the two subjects that
are furthest apart.
4/17/2020 DR ATHAR KHAN 22
Average (between groups) linkage method (sometimes referred to as
UPGMA)
The distance between two clusters is calculated as the average distance
between all pairs of subjects in the two clusters.
Centroid Method
Here the centroid (mean value for each variable) of each cluster is calculated
and the distance between centroids is used. Clusters whose centroids are
closest together are merged.
4/17/2020 DR ATHAR KHAN 23
Ward’s Method
▪ In this method all possible pairs of clusters are combined and
the sum of the squared distances within each cluster is
calculated.
▪ This is then summed over all clusters.
▪ The combination that gives the lowest sum of squares is
chosen.
▪ The aim in Ward’s method is to join cases into clusters such
that the variance within a cluster is minimised.
▪ To be more precise, two clusters are merged if this merger
results in the minimum increase in the error sum of squares.
▪ Most popular Method
4/17/2020 DR ATHAR KHAN 24
Selecting the optimum number of clusters
▪ Once the cluster analysis has been carried out it is then necessary to
select the ’best’ cluster solution.
▪ # of clusters and within cluster variances
4/17/2020 DR ATHAR KHAN 25
Dendrogram
1
2
34
In the dendrogram above, the height of the
dendrogram indicates the order in which the
clusters were joined.
Dendrograms cannot tell you how many clusters
you should have4/17/2020 DR ATHAR KHAN 26
Data Preparation
• To perform a cluster analysis, generally, the data
should be prepared as follows:
• Any missing value in the data must be removed or
estimated.
• The data must be standardized(Z SCORES)
4/17/2020 DR ATHAR KHAN 27
Limitations of Cluster Analysis
• There are several things to be aware of when conducting
cluster analysis:
– The different methods of clustering usually give very different results.
This occurs because of the different criterion for merging clusters
(including cases). It is important to think carefully about which method
is best for what you are interested in looking at.
– With the exception of simple linkage, the results will be affected by
the way in which the variables are ordered.
– The analysis is not stable when cases are dropped: this occurs because
selection of a case (or merger of clusters) depends on similarity of one
case to the cluster.
4/17/2020 DR ATHAR KHAN 28
Limitations of Cluster Analysis
• Imagine we wanted to look at clusters of cases
referred for psychiatric treatment.
• We measured each subject on four questionnaires:
Spielberger Trait Anxiety Inventory (STAI), the Beck
Depression Inventory (BDI), a measure of Intrusive
Thoughts and Rumination (IT) and a measure of
Impulsive Thoughts and Actions (Impulse).
• The rationale behind this analysis is that people with
the same disorder should report a similar pattern of
scores across the measures (so the profiles of their
responses should be similar)
4/17/2020 DR ATHAR KHAN 29
Video : Hierarchical Clustering : Agglomerative Clustering and
Divisive Clustering
https://www.youtube.com/watch?v=7enWesSofhg
4/17/2020 DR ATHAR KHAN 30
4/17/2020 DR ATHAR KHAN 31
4/17/2020 DR ATHAR KHAN 32
4/17/2020 DR ATHAR KHAN 33
4/17/2020 DR ATHAR KHAN 34
4/17/2020 DR ATHAR KHAN 35
Agglomeration schedule: Shows how the clusters are combined at each stage.
Stage 1: Cases 1 and 4 have the smallest distance ("Coefficients" = .168) => first
cluster {1,4}
Stage 2: Cases 10 and 12 have the second smallest distance => second cluster
{10,12}4/17/2020 DR ATHAR KHAN 36
STAGE 1
STAGE 7
STAGE 3
STAGE 4
STAGE 5
STAGE 2
STAGE 6
4/17/2020 DR ATHAR KHAN 37
Agglomeration schedule: Shows how the clusters are combined at each stage.
The next part of the table shows the stage at which each cluster first appears.
4/17/2020 DR ATHAR KHAN 38
Agglomeration schedule: Shows how the clusters are combined at each stage.
In stage 6, cluster 1 is the cluster that was formed in stage 1...
4/17/2020 DR ATHAR KHAN 39
Agglomeration schedule: Shows how the clusters are combined at each stage.
Stage 1: Cases 1 and 4 have the smallest distance ("Coefficients" = .168) => first cluster
{1,4}
First cluster {1,4} is merged with case 13 in stage 6 ("Next Stage") => Cluster {1,4,13}
0 means first time
4/17/2020 DR ATHAR KHAN 40
STAGE 1
STAGE 2
STAGE 5
4/17/2020 DR ATHAR KHAN 41
▪ The Coefficients column indicates the distance between the two clusters (or
cases) joined at each stage.
▪ The values here depend on the proximity measure and linkage method used
in the analysis.
▪ For a good cluster solution, you will see a sudden jump in the distance
coefficient as you read down the table.
▪ The stage before the sudden change indicates the optimal stopping point for
merging clusters.
3 clusters
2 Clusters
1 Cluster
4/17/2020 DR ATHAR KHAN 42
NUMBER OF CLUSTERS
▪ Number of cases 15
▪ Step of ‘elbow’ 12
15 – 12
Number of clusters 3
4/17/2020 DR ATHAR KHAN 43
Select
Coefficients
4/17/2020 DR ATHAR KHAN 44
Scree Plot
.000
2.000
4.000
6.000
8.000
10.000
12.000
14.000
16.000
18.000
20.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14
4/17/2020 DR ATHAR KHAN 45
▪ Notice how the "branches" merge together as you look from left to right in the
dendrogram.
▪ Cases or clusters that are joined by lines "further down" the tree (near the left side
of the dendrogram) are very similar.
The dendrogram (or "tree diagram") shows relative similarities between cases.
4/17/2020 DR ATHAR KHAN 46
▪ Cases or clusters that are joined by lines "further up" the tree (near the right side)
are dissimilar.
▪ Cluster distances are rescaled so that they range from 0 to 25 in this plot.
4/17/2020 DR ATHAR KHAN 47
▪ This would identify 3 clusters (GREEN), one for each point where a branch intersects
our line.
▪ By considering different cut points for our line, we can get solutions with different
numbers of cluster.
▪ A good cluster solution is one with small within-cluster distances, but large between
cluster distances.
1
2
3
4/17/2020 DR ATHAR KHAN 48
▪ Choose the number of clusters within the largest increase in heterogeneity.
1
2
3
Standardized distance
4/17/2020 DR ATHAR KHAN 49
▪ This table shows cluster membership for each case, according to the
number of clusters you requested.
▪ You can attempt to interpret the clusters by observing which cases are
grouped together.
4/17/2020 DR ATHAR KHAN 50
▪ This table shows cluster membership for each case, according to the
number of clusters you requested.
▪ You can attempt to interpret the clusters by observing which cases are
grouped together.
4/17/2020 DR ATHAR KHAN 51
4/17/2020 DR ATHAR KHAN 52
▪ Having eyeballed the dendrogram and decided how many
clusters are present it is possible to re-run the analysis asking
SPSS to save a new variable in which cluster codes are assigned
to cases (with the researcher specifying the number of clusters
in the data).
▪ For these data, we saw three clear clusters and so we could re-
run the analysis asking for cluster group codings for three
clusters (in fact, I told you to do this as part of the original
analysis).
▪ The output below shows the resulting codes for each case in this
analysis. It’s pretty clear that these codes map exactly onto the
DSM-IV classifications.
4/17/2020 DR ATHAR KHAN 53
▪ This table shows cluster membership for each case, according to the
number of clusters you requested.
▪ You can attempt to interpret the clusters by observing which cases are
grouped together.
4/17/2020 DR ATHAR KHAN 54
4/17/2020 DR ATHAR KHAN 55
DR ATHAR KHAN
MBBS, MCPS, DPH, DCPS-HCSM, DCPS-HPE, MBA, PGD-
STATISTICS, CCRP
ASSOCIATE PROFESSOR
DEPARTMENT OF COMMUNITY MEDICINE
LIAQUAT COLLEGE OF MEDICINE & DENTISTRY
KARACHI, PAKISTAN
0092-3232135932

Cluster Analysis

  • 1.
    CLUSTER ANALYSIS DR ATHARKHAN LIAQUAT COLLEGE OF MEDICINE & DENTISTRY matharm@yahoo.com
  • 2.
  • 3.
    DEFINITION • Cluster Analysisis a way of grouping cases of data based on the similarity of responses to several variables. ▪ The fundamental problem clustering address is to divide the data into meaningful groups (clusters). Group Together Variables Grouping Cases Factor Analysis Cluster Analysis 4/17/2020 DR ATHAR KHAN 3
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Cluster 1 Cluster 2 Cluster3 4/17/2020 DR ATHAR KHAN 11
  • 12.
    Unsupervised learning isa machine learning technique, where you do not need to supervise the model. Instead, you need to allow the model to work on its own to discover information, only have input data (X) and no corresponding output variables.4/17/2020 DR ATHAR KHAN 12
  • 13.
    Types of Data ▪The data used in cluster analysis can be interval, ordinal or categorical. ▪ However, having a mixture of different types of variable will make the analysis more complicated. ▪ This is because in cluster analysis you need to have some way of measuring the distance between observations and the type of measure used will depend on what type of data you have. 4/17/2020 DR ATHAR KHAN 13
  • 14.
    Measures of Distance ▪A number of different measures have been proposed to measure ’distance’ for categorical data: ▪ K-Means algorithm for categorical data, ROCK, LIMBO, CLICKS, Ward’s agglomerativealgorithm ▪ In a hierarchical clustering algorithm most used is Ward’s. ▪ It is the most widely used method for measuring the distance between the objects for interval data is Euclidean Distance. 4/17/2020 DR ATHAR KHAN 14
  • 15.
    Euclidean Distance, d Euclideandistance is the geometric distance between two objects (or cases). Therefore, if we were to call George subject i and Zippy subject j, then we could express their Euclidean distance in terms of the following equation: Euclidean distances the smaller the distance, the more similar the cases.4/17/2020 DR ATHAR KHAN 15
  • 16.
    Measures of Distance ▪When using a measure such as the Euclidean distance, the scale of measurement of the variables under consideration is an issue, as changing the scale will obviously effect the distance between subjects (e.g. a difference of 10cm could being a difference of 100mm). ▪ To get around this problem each variable can be standardized (converted to z-scores). 4/17/2020 DR ATHAR KHAN 16
  • 17.
    Approaches to ClusterAnalysis ▪ There are a number of different methods that can be used to carry out a cluster analysis: ▪ Hierarchical methods ▪ – Agglomerative methods ▪ – Divisive methods ▪ Non-hierarchical methods (often known as k-means clustering methods) 4/17/2020 DR ATHAR KHAN 17
  • 18.
    Agglomerative Methods ▪ Agglomerativeclustering is Bottom-up technique start by considering each data point as its own cluster and merging them together into larger groups from the bottom up into a single giant cluster. 4/17/2020 DR ATHAR KHAN 18
  • 19.
    Divisive Clustering ▪ Divisiveclustering is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on until each case is a cluster. Agglomerative methods are used more often than Divisive methods 4/17/2020 DR ATHAR KHAN 19
  • 20.
  • 21.
    Hierarchical agglomerative methods Withinthis approach to cluster analysis there are a number of different methods used to determine which clusters should be joined at each stage. Linkage Function/Creating the Clusters 4/17/2020 DR ATHAR KHAN 21
  • 22.
    Nearest neighbour method(single linkage method) In this method the distance between two clusters is defined to be the distance between the two closest members, or neighbours. Furthest neighbour method (complete linkage method) In this case the distance between two clusters is defined to be the maximum distance between members — i.e. the distance between the two subjects that are furthest apart. 4/17/2020 DR ATHAR KHAN 22
  • 23.
    Average (between groups)linkage method (sometimes referred to as UPGMA) The distance between two clusters is calculated as the average distance between all pairs of subjects in the two clusters. Centroid Method Here the centroid (mean value for each variable) of each cluster is calculated and the distance between centroids is used. Clusters whose centroids are closest together are merged. 4/17/2020 DR ATHAR KHAN 23
  • 24.
    Ward’s Method ▪ Inthis method all possible pairs of clusters are combined and the sum of the squared distances within each cluster is calculated. ▪ This is then summed over all clusters. ▪ The combination that gives the lowest sum of squares is chosen. ▪ The aim in Ward’s method is to join cases into clusters such that the variance within a cluster is minimised. ▪ To be more precise, two clusters are merged if this merger results in the minimum increase in the error sum of squares. ▪ Most popular Method 4/17/2020 DR ATHAR KHAN 24
  • 25.
    Selecting the optimumnumber of clusters ▪ Once the cluster analysis has been carried out it is then necessary to select the ’best’ cluster solution. ▪ # of clusters and within cluster variances 4/17/2020 DR ATHAR KHAN 25
  • 26.
    Dendrogram 1 2 34 In the dendrogramabove, the height of the dendrogram indicates the order in which the clusters were joined. Dendrograms cannot tell you how many clusters you should have4/17/2020 DR ATHAR KHAN 26
  • 27.
    Data Preparation • Toperform a cluster analysis, generally, the data should be prepared as follows: • Any missing value in the data must be removed or estimated. • The data must be standardized(Z SCORES) 4/17/2020 DR ATHAR KHAN 27
  • 28.
    Limitations of ClusterAnalysis • There are several things to be aware of when conducting cluster analysis: – The different methods of clustering usually give very different results. This occurs because of the different criterion for merging clusters (including cases). It is important to think carefully about which method is best for what you are interested in looking at. – With the exception of simple linkage, the results will be affected by the way in which the variables are ordered. – The analysis is not stable when cases are dropped: this occurs because selection of a case (or merger of clusters) depends on similarity of one case to the cluster. 4/17/2020 DR ATHAR KHAN 28
  • 29.
    Limitations of ClusterAnalysis • Imagine we wanted to look at clusters of cases referred for psychiatric treatment. • We measured each subject on four questionnaires: Spielberger Trait Anxiety Inventory (STAI), the Beck Depression Inventory (BDI), a measure of Intrusive Thoughts and Rumination (IT) and a measure of Impulsive Thoughts and Actions (Impulse). • The rationale behind this analysis is that people with the same disorder should report a similar pattern of scores across the measures (so the profiles of their responses should be similar) 4/17/2020 DR ATHAR KHAN 29
  • 30.
    Video : HierarchicalClustering : Agglomerative Clustering and Divisive Clustering https://www.youtube.com/watch?v=7enWesSofhg 4/17/2020 DR ATHAR KHAN 30
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
    Agglomeration schedule: Showshow the clusters are combined at each stage. Stage 1: Cases 1 and 4 have the smallest distance ("Coefficients" = .168) => first cluster {1,4} Stage 2: Cases 10 and 12 have the second smallest distance => second cluster {10,12}4/17/2020 DR ATHAR KHAN 36
  • 37.
    STAGE 1 STAGE 7 STAGE3 STAGE 4 STAGE 5 STAGE 2 STAGE 6 4/17/2020 DR ATHAR KHAN 37
  • 38.
    Agglomeration schedule: Showshow the clusters are combined at each stage. The next part of the table shows the stage at which each cluster first appears. 4/17/2020 DR ATHAR KHAN 38
  • 39.
    Agglomeration schedule: Showshow the clusters are combined at each stage. In stage 6, cluster 1 is the cluster that was formed in stage 1... 4/17/2020 DR ATHAR KHAN 39
  • 40.
    Agglomeration schedule: Showshow the clusters are combined at each stage. Stage 1: Cases 1 and 4 have the smallest distance ("Coefficients" = .168) => first cluster {1,4} First cluster {1,4} is merged with case 13 in stage 6 ("Next Stage") => Cluster {1,4,13} 0 means first time 4/17/2020 DR ATHAR KHAN 40
  • 41.
    STAGE 1 STAGE 2 STAGE5 4/17/2020 DR ATHAR KHAN 41
  • 42.
    ▪ The Coefficientscolumn indicates the distance between the two clusters (or cases) joined at each stage. ▪ The values here depend on the proximity measure and linkage method used in the analysis. ▪ For a good cluster solution, you will see a sudden jump in the distance coefficient as you read down the table. ▪ The stage before the sudden change indicates the optimal stopping point for merging clusters. 3 clusters 2 Clusters 1 Cluster 4/17/2020 DR ATHAR KHAN 42
  • 43.
    NUMBER OF CLUSTERS ▪Number of cases 15 ▪ Step of ‘elbow’ 12 15 – 12 Number of clusters 3 4/17/2020 DR ATHAR KHAN 43
  • 44.
  • 45.
  • 46.
    ▪ Notice howthe "branches" merge together as you look from left to right in the dendrogram. ▪ Cases or clusters that are joined by lines "further down" the tree (near the left side of the dendrogram) are very similar. The dendrogram (or "tree diagram") shows relative similarities between cases. 4/17/2020 DR ATHAR KHAN 46
  • 47.
    ▪ Cases orclusters that are joined by lines "further up" the tree (near the right side) are dissimilar. ▪ Cluster distances are rescaled so that they range from 0 to 25 in this plot. 4/17/2020 DR ATHAR KHAN 47
  • 48.
    ▪ This wouldidentify 3 clusters (GREEN), one for each point where a branch intersects our line. ▪ By considering different cut points for our line, we can get solutions with different numbers of cluster. ▪ A good cluster solution is one with small within-cluster distances, but large between cluster distances. 1 2 3 4/17/2020 DR ATHAR KHAN 48
  • 49.
    ▪ Choose thenumber of clusters within the largest increase in heterogeneity. 1 2 3 Standardized distance 4/17/2020 DR ATHAR KHAN 49
  • 50.
    ▪ This tableshows cluster membership for each case, according to the number of clusters you requested. ▪ You can attempt to interpret the clusters by observing which cases are grouped together. 4/17/2020 DR ATHAR KHAN 50
  • 51.
    ▪ This tableshows cluster membership for each case, according to the number of clusters you requested. ▪ You can attempt to interpret the clusters by observing which cases are grouped together. 4/17/2020 DR ATHAR KHAN 51
  • 52.
  • 53.
    ▪ Having eyeballedthe dendrogram and decided how many clusters are present it is possible to re-run the analysis asking SPSS to save a new variable in which cluster codes are assigned to cases (with the researcher specifying the number of clusters in the data). ▪ For these data, we saw three clear clusters and so we could re- run the analysis asking for cluster group codings for three clusters (in fact, I told you to do this as part of the original analysis). ▪ The output below shows the resulting codes for each case in this analysis. It’s pretty clear that these codes map exactly onto the DSM-IV classifications. 4/17/2020 DR ATHAR KHAN 53
  • 54.
    ▪ This tableshows cluster membership for each case, according to the number of clusters you requested. ▪ You can attempt to interpret the clusters by observing which cases are grouped together. 4/17/2020 DR ATHAR KHAN 54
  • 55.
    4/17/2020 DR ATHARKHAN 55 DR ATHAR KHAN MBBS, MCPS, DPH, DCPS-HCSM, DCPS-HPE, MBA, PGD- STATISTICS, CCRP ASSOCIATE PROFESSOR DEPARTMENT OF COMMUNITY MEDICINE LIAQUAT COLLEGE OF MEDICINE & DENTISTRY KARACHI, PAKISTAN 0092-3232135932