CLUSTER ANALYSIS
Wei-Jiun, Shen Ph. D.
When you were still young and naïve…
 Classification
You may classify them by
 Shape
You may classify them by
 Color
You may classify them by
 Sum of internal angles
Classification
 Shape
 Color
 Sum of internal angles
 Similarity of characteristics
Purpose of cluster analysis
 Grouping objects based on the similarity of
characteristics they possess.
 Homogeneity
 Heterogeneity
 Geometrically, the objects within clusters will be
close together, while the distance between
clusters will be farther apart.
Major role that cluster analysis can play
 Data reduction
 Classify large number of observation into
manageable groups
 Taxonomy description
 Exploratory
 Confirmatory
 Examining the influence of cluster on dependent
variables
 Whether different motivational constructs are
differentially associated with effort and enjoyment
How does cluster analysis work?
 The primary objective of cluster analysis is to
define the structure of the data by placing the
most similar observations into groups.
 What clustering variables can be used?
 How do we measure similarity?
 How do we form clusters?
 How many clusters do we form?
Selecting clustering variables
 Statistically,
 Any quantitative variable
 Theoretically, conceptually, practically,
 Theoretical fundament corresponding to research Q
Measuring similarity
 Similarity
 The degree of correspondence among objects across
all of the characteristics.
 Correlational measures
 Distance measures
Similarity measure
 Correlation measure
 Grouping cases base on respondent pattern
 Distance measure
 Grouping cases base on distance
0
1
2
3
4
5
X1 X2 X3 X4
case1
case2
case3
Distance measures
 Euclidean distance
 Squared Euclidean distance
 City-block (Manhattan) distance
 Chebychev distance
 Mahalanobis D 2(standardization/variance-covariance)
Illustration
Forming clusters
 Similarity
 Method
 Hierarchical
 Agglomerative / divisive
 non-hierarchical (quick)
Number of clusters
 Theoretical specified
 Statistical stopping rule
 Measures of heterogeneity change
Example
 N=7
 2 variables (scale range from 0-10 )
 Hierarchical cluster analysis with agglomerative
method
Example
 Scatterplot
Example - measure similarity
 Euclidean distance
Example - similarity & number of clusters
 Procedure
0.778
-0.048
0.090
0.662
0.524
(|E-F |+|E-G |+|F-G |)/ 3(|E-F |+|E-G |+|F-G |)+ |F-G |/ 4
Example – graphing
 Graphical portrayal
Example
 Dendogram
Standardizing the data
 Clustering variables that have scales using widely
differing numbers of scale points or that exhibit
large differences in standard deviations should
be standardized.
 Z-score
 Standardized distance (e.g., Mahalanobis distance)
Deriving clusters
 Hierarchical cluster analysis
 Hierarchical
 Non-hierarchical cluster analysis
 K-means
 Combination of both methods
 Two Step
HIERARCHICAL
CLUSTER ANALYSIS
Hierarchical cluster analysis (HCA)
 The stepwise procedure
 Agglomerate or divide group step by step
 Agglomerative (SPSS selected)
 Aggregate object with object / cluster with cluster
 N clusters to 1 cluster
 Divisive
 Separate cluster to object
 1 cluster to n clusters
Dendogram / tree graph
Agglomerative aglorithms
 Single linkage
 Complete linkage
 Average linkage
 Centroid method
 Ward’s method
 Mahalanobis diatance
Agglomerative aglorithms
 Single linkage / neighbor method
 Defines similarity between clusters as the shortest
distance from any object in one cluster to any object
in the other.
Pics:
Retrieved from: http://ppt.cc/uKm0
Agglomerative aglorithms
 Complete linkage / Farthest – neighbor method
 Defines two clusters based on the maximum distance
between any two members in the two clusters.
Agglomerative aglorithms
 Centroid method
 Cluster centroids
 Are the mean values of the observation on the variables of
the cluster
 The distance between the two clusters equals the
distance between the two centroids
Agglomerative aglorithms
 Average linkage
 The distance between two clusters is defined as the
average distance between all pairs of the two clusters’
members.
Agglomerative aglorithms
 Ward’s method
 The similarity between two clusters is the sum of
squares within the clusters summed over all variables.
 Least variance within cluster
Number of clusters
 Theoretical specified
 Statistical stopping rule
 Measures of heterogeneity change
Hierarchical cluster analysis
 The hierarchical cluster analysis provides an
excellent framework with which to compare any
set of cluster solutions.
 This method helps in judging how many clusters
should be retained or considered.
NON-HIERARCHICAL
CLUSTER ANALYSIS
Non-hierarchical cluster analysis (non-HCA)
 Non-hierarchical cluster analysis assign objects
into clusters once the number of clusters is
specified.
 Two steps in non-HCA
 Specify cluster seed: identify starting points
 Assignment-assign each observation to one of the
cluster seeds.
Non-hierarchical cluster analysis-algorithm
 Aims to partition n observation into k clusters in
which each observation belongs to the cluster
with the nearest mean.
 Cluster seed assignment
 Sequential (1 by 1)
 Parallel (simultaneously)
 Optimization
 K-means method
0
1
2
3
4
5
6
0 1 2 3 4 5
scattrplot
case
Pros and Cons of HCA
 Advantage
 Comprehensive information
 A wide range of alternative clustering solution
 Disadvantage
 Outliers
 Large samples / large numbers of variable
Pros and Cons of non-HCA
 Advantage
 Less susceptible to outliers
 Extremely large data sets
 Disadvantage
 Less information
 Susceptible to initial seed point
Combination of each method
 Two step
 Hierarchical technique is used to select the number
of clusters and profile clusters centers that serve as
initial cluster seeds in the nonhierarchical procedure.
 A nonhierarchical method then clusters all
observations using the seed points to provide more
accurate cluster memberships.
Interpretation of clusters
 Mean profile of cluster
 Name the clusters
Validation of clusters
 Cross validation
 Two sub-sample
 Confirmatory
 Discriminant analysis
 Predictive validity
 Differences on variables
 Profile analysis
 (M)Analysis of variance
Assumptions of cluster analysis
 Inferential statistics?
 Representativeness
 Multicollinearity
 Factor analysis
 Cluster analysis
Compare to other multivariate analyses
 Cluster analysis (CA) vs. Factor analysis (FA)
 CA: grouping cases based on distance (proximity)
 FA: grouping observations based on pattern of
variations (correlation)
 Cluster analysis vs. Discriminant analysis (DA)
 CA: group is NOT given (exploratory)
 DA: group is given (confirmatory)
Summary
 Research question
 Assumption confirmation
 Multicollinearity
 Cluster analysis
 Selecting clustering variables
 Conducting analysis
 Interpreting clusters
 Validating clusters
 Main analysis
 It is just a beginning…
Practice
 根據相關實證研究的證據,教練的自主支持行為、威
嚇、過度控制、有條件式的關愛與酬賞控制等多種教
練行為,是影響運動員的重要因子。研究生戴平台想
知道教練對運動員的認知、情意與行為後果的影響。
戴平台認為,教練可以從運動員所知覺到的教練行為
的多種組合被區分為不同的類型,檢視在不同類型教
練下,運動員的認知、情意與行為反應,應該比較能
瞭解運動團隊中,教練對運動員所產生的影響。請根
據運動員所知覺到的教練行為,以集群分析幫戴平台
將運動員所知覺到的教練區分為不同類型。

Cluster analysis