Cluster analysis

CLUSTER ANALYSIS
Wei-Jiun, Shen Ph. D.

When you were still young and naïve…
 Classification

You may classify them by
 Shape

 Color

 Sum of internal angles

Classification
 Shape
 Color
 Sum of internal angles
 Similarity of characteristics

Purpose of cluster analysis
 Grouping objects based on the similarity of
characteristics they possess.
 Homogeneity
 Heterogeneity
 Geometrically, the objects within clusters will be
close together, while the distance between
clusters will be farther apart.

Major role that cluster analysis can play
 Data reduction
 Classify large number of observation into
manageable groups
 Taxonomy description
 Exploratory
 Confirmatory
 Examining the influence of cluster on dependent
variables
 Whether different motivational constructs are
differentially associated with effort and enjoyment

How does cluster analysis work?
 The primary objective of cluster analysis is to
define the structure of the data by placing the
most similar observations into groups.
 What clustering variables can be used?
 How do we measure similarity?
 How do we form clusters?
 How many clusters do we form?

Selecting clustering variables
 Statistically,
 Any quantitative variable
 Theoretically, conceptually, practically,
 Theoretical fundament corresponding to research Q

Measuring similarity
 Similarity
 The degree of correspondence among objects across
all of the characteristics.
 Correlational measures
 Distance measures

Similarity measure
 Correlation measure
 Grouping cases base on respondent pattern
 Distance measure
 Grouping cases base on distance
0
1
2
3
4
5
X1 X2 X3 X4
case1
case2
case3

Distance measures
 Euclidean distance
 Squared Euclidean distance
 City-block (Manhattan) distance
 Chebychev distance
 Mahalanobis D 2(standardization/variance-covariance)

Forming clusters
 Similarity
 Method
 Hierarchical
 Agglomerative / divisive
 non-hierarchical (quick)

Number of clusters
 Theoretical specified
 Statistical stopping rule
 Measures of heterogeneity change

Example
 N=7
 2 variables (scale range from 0-10 )
 Hierarchical cluster analysis with agglomerative
method

Example - measure similarity
 Euclidean distance

Example - similarity & number of clusters
 Procedure
0.778
-0.048
0.090
0.662
0.524
（｜E-F ｜+｜E-G ｜+｜F-G ｜）/ 3（｜E-F ｜+｜E-G ｜+｜F-G ｜）+ ｜F-G ｜/ 4

Example – graphing
 Graphical portrayal

Standardizing the data
 Clustering variables that have scales using widely
differing numbers of scale points or that exhibit
large differences in standard deviations should
be standardized.
 Z-score
 Standardized distance (e.g., Mahalanobis distance)

Deriving clusters
 Hierarchical cluster analysis
 Hierarchical
 Non-hierarchical cluster analysis
 K-means
 Combination of both methods
 Two Step

Hierarchical cluster analysis (HCA)
 The stepwise procedure
 Agglomerate or divide group step by step
 Agglomerative (SPSS selected)
 Aggregate object with object / cluster with cluster
 N clusters to 1 cluster
 Divisive
 Separate cluster to object
 1 cluster to n clusters

Agglomerative aglorithms
 Single linkage
 Complete linkage
 Average linkage
 Centroid method
 Ward’s method
 Mahalanobis diatance

 Single linkage / neighbor method
 Defines similarity between clusters as the shortest
distance from any object in one cluster to any object
in the other.
Pics:
Retrieved from: http://ppt.cc/uKm0

 Complete linkage / Farthest – neighbor method
 Defines two clusters based on the maximum distance
between any two members in the two clusters.

 Centroid method
 Cluster centroids
 Are the mean values of the observation on the variables of
the cluster
 The distance between the two clusters equals the
distance between the two centroids

 Average linkage
 The distance between two clusters is defined as the
average distance between all pairs of the two clusters’
members.

 Ward’s method
 The similarity between two clusters is the sum of
squares within the clusters summed over all variables.
 Least variance within cluster

Hierarchical cluster analysis
 The hierarchical cluster analysis provides an
excellent framework with which to compare any
set of cluster solutions.
 This method helps in judging how many clusters
should be retained or considered.

NON-HIERARCHICAL
CLUSTER ANALYSIS

Non-hierarchical cluster analysis (non-HCA)
 Non-hierarchical cluster analysis assign objects
into clusters once the number of clusters is
specified.
 Two steps in non-HCA
 Specify cluster seed: identify starting points
 Assignment-assign each observation to one of the
cluster seeds.

Non-hierarchical cluster analysis-algorithm
 Aims to partition n observation into k clusters in
which each observation belongs to the cluster
with the nearest mean.
 Cluster seed assignment
 Sequential (1 by 1)
 Parallel (simultaneously)
 Optimization
 K-means method
0
1
2
3
4
5
6
0 1 2 3 4 5
scattrplot
case

Pros and Cons of HCA
 Advantage
 Comprehensive information
 A wide range of alternative clustering solution
 Disadvantage
 Outliers
 Large samples / large numbers of variable

Pros and Cons of non-HCA
 Advantage
 Less susceptible to outliers
 Extremely large data sets
 Disadvantage
 Less information
 Susceptible to initial seed point

Combination of each method
 Two step
 Hierarchical technique is used to select the number
of clusters and profile clusters centers that serve as
initial cluster seeds in the nonhierarchical procedure.
 A nonhierarchical method then clusters all
observations using the seed points to provide more
accurate cluster memberships.

Interpretation of clusters
 Mean profile of cluster
 Name the clusters

Validation of clusters
 Cross validation
 Two sub-sample
 Confirmatory
 Discriminant analysis
 Predictive validity
 Differences on variables
 Profile analysis
 (M)Analysis of variance

Assumptions of cluster analysis
 Inferential statistics?
 Representativeness
 Multicollinearity
 Factor analysis
 Cluster analysis

Compare to other multivariate analyses
 Cluster analysis (CA) vs. Factor analysis (FA)
 CA: grouping cases based on distance (proximity)
 FA: grouping observations based on pattern of
variations (correlation)
 Cluster analysis vs. Discriminant analysis (DA)
 CA: group is NOT given (exploratory)
 DA: group is given (confirmatory)

Summary
 Research question
 Assumption confirmation
 Multicollinearity
 Cluster analysis
 Selecting clustering variables
 Conducting analysis
 Interpreting clusters
 Validating clusters
 Main analysis
 It is just a beginning…

Practice
 根據相關實證研究的證據，教練的自主支持行為、威
嚇、過度控制、有條件式的關愛與酬賞控制等多種教
練行為，是影響運動員的重要因子。研究生戴平台想
知道教練對運動員的認知、情意與行為後果的影響。
戴平台認為，教練可以從運動員所知覺到的教練行為
的多種組合被區分為不同的類型，檢視在不同類型教
練下，運動員的認知、情意與行為反應，應該比較能
瞭解運動團隊中，教練對運動員所產生的影響。請根
據運動員所知覺到的教練行為，以集群分析幫戴平台
將運動員所知覺到的教練區分為不同類型。

Cluster analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (20)

Similar to Cluster analysis

Similar to Cluster analysis (20)

More from 緯鈞沈

More from 緯鈞沈 (8)

Recently uploaded

Recently uploaded (20)