PROGRAM: CLUSTER ANALYSIS WITH R
MODULE : K-MEANS AND HIERARCHICAL
CLUSTERING
BY SAMMYA SENGUPTA
Cluster Analysis Workshop
WHAT IS CLUSTER ANALYSIS
 Cluster analysis is a class of techniques
used to classify objects or cases into
relatively homogeneous groups called
cluster.
 Objects of each cluster tend to be similar to
each other and dissimilar to objects in other
clusters.
 Cluster analysis is called classification
analysis or numerical taxonomy.
HOW CLUSTERS BEING FORMED IN K-MEANS
METHOD
USAGE OF CLUSTER ANALYSIS IN BUSINESS
CONTEXT
 Segmenting the market
 Understanding buyer behavior
 Indentifying new prod. Development
 Selecting test markets
CLUSTER ANALYSIS AS A MULTIVARIATE
TECHNIQUE
 Cluster analysis classifies objects(e.g.,
respondents, products etc) so that each
objects is similar to others in the cluster
based on a set of selected characteristics.
 The resulting clusters of objects should
exhibit high internal(within cluster)
homogeneity and high external(between
cluster) heterogeneity.
SELECTION OF CLUSTERING VARIABLES
 Conceptual Consideration
 Practical Consideration
NECESSITY OF CONCEPTUAL SUPPORT IN
CLUSTER ANALYSIS (LITTLE BIT CAUTION…..)
 Cluster analysis is descriptive, atheoretical
and non-inferential
 Cluster analysis will always create clusters,
regardless of the actual existence of any
structure in the data
 The cluster solutions is not generalizable
because it is totally dependent upon the
variables used as the basis for the similarity
measure
RESEARCH DESIGN OF CLUSTER ANALYSIS
 Is the sample size is adequate
 Can outlier be detected, if so, should they be
deleted
 How should object similarity measure
 Should data be standardized
MEASURING SIMILARITY
 Correlation measure
 Distance measure
DISTANCE MEASURE
 Euclidean distance
 Squared Euclidean distance
 City-block Manhattan distance
 Chebychev distance
 Mahalanblish distance
EUCLIDEAN DISTANCE
WHY DATA STANDARDIZATION IS REQUIRED
 Mean of the Standardized data is 0 and S.D
is 1
ASSUMPTIONS IN CLUSTER ANALYSIS
 Impact of Multicollinearity
DERIVING CLUSTERS AND ASSESSING OVERALL
MODEL FIT
 Select the partitioning procedure used for
forming clusters
 Make decisions on the number of clusters to
be formed
CLUSTER PARTITIONING PROCEDURE
HIERARCHICAL CLUSTERING
 Hierarchical procedure involves a series of n-
1 clustering decisions(where n equals the no
of Obs.) that combine into a hierarchy or a
tree like structure.
 Two basic types of hierarchical clustering
procedure are Agglomerative and Divisive.
CONT…
CONTD..
CONTD..
K-MEANS CLUSTERING
 In contrast to hierarchical methods, non
hierarchical procedure do not involve the tree
like construction process. Instead they
assign objects into clusters once the number
of clusters to be formed is specified.
SPECIFY CLUSTER SEEDS
 Selecting seed points
- Researcher Specified
- Sample Generated
K-MEANS CLUSTERING ALGORITHM
 Sequential threshold
 Parallel threshold
 optimization
NUMBER OF CLUSTERS TO BE FORMED
 Measures of Heterogeneity change
 Percentage change in heterogeneity
 Measures of Variance change(RMSSTD)
 Statistical Measure of heterogeneity change
 Direct Measures of Heterogeneity
N.B- A rule of thumb in the practical field is that
4-6 cluster will be a perfect fit.
SHOULD HIERARCHICAL OR NON-
HIERARCHICAL PROCEDURE USED
 Hierarchical Clustering is not suitable for
large datasets as the multitude of
calculations involved would be
impossibly huge. Thus K-Means
clustering is the most used method of
clustering.
CLUSTER PROFILING(K-MEANS)
 Cluster Solution is profiled against
Variables to identify and assign the
character of individual clusters.
SEGMENTATION
EXTREMELY
VALUE SEEKER
11%
EXTREMELY
DISCOUNT MINDED
12%
DISCOUNT
ORIENTED BUT ALSO
VALUE SEEKER
11%
VALUE SEEKER BUT
DON'T LET DISCOUNT
16%
MIXED ORIANTATION
35%
VALUE SEEKER
15%
SEGMENTATION
THANK YOU……

Program_Cluster_Analysis

  • 1.
    PROGRAM: CLUSTER ANALYSISWITH R MODULE : K-MEANS AND HIERARCHICAL CLUSTERING BY SAMMYA SENGUPTA Cluster Analysis Workshop
  • 2.
    WHAT IS CLUSTERANALYSIS  Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups called cluster.  Objects of each cluster tend to be similar to each other and dissimilar to objects in other clusters.  Cluster analysis is called classification analysis or numerical taxonomy.
  • 3.
    HOW CLUSTERS BEINGFORMED IN K-MEANS METHOD
  • 4.
    USAGE OF CLUSTERANALYSIS IN BUSINESS CONTEXT  Segmenting the market  Understanding buyer behavior  Indentifying new prod. Development  Selecting test markets
  • 5.
    CLUSTER ANALYSIS ASA MULTIVARIATE TECHNIQUE  Cluster analysis classifies objects(e.g., respondents, products etc) so that each objects is similar to others in the cluster based on a set of selected characteristics.  The resulting clusters of objects should exhibit high internal(within cluster) homogeneity and high external(between cluster) heterogeneity.
  • 6.
    SELECTION OF CLUSTERINGVARIABLES  Conceptual Consideration  Practical Consideration
  • 7.
    NECESSITY OF CONCEPTUALSUPPORT IN CLUSTER ANALYSIS (LITTLE BIT CAUTION…..)  Cluster analysis is descriptive, atheoretical and non-inferential  Cluster analysis will always create clusters, regardless of the actual existence of any structure in the data  The cluster solutions is not generalizable because it is totally dependent upon the variables used as the basis for the similarity measure
  • 8.
    RESEARCH DESIGN OFCLUSTER ANALYSIS  Is the sample size is adequate  Can outlier be detected, if so, should they be deleted  How should object similarity measure  Should data be standardized
  • 9.
    MEASURING SIMILARITY  Correlationmeasure  Distance measure
  • 10.
    DISTANCE MEASURE  Euclideandistance  Squared Euclidean distance  City-block Manhattan distance  Chebychev distance  Mahalanblish distance
  • 11.
  • 12.
    WHY DATA STANDARDIZATIONIS REQUIRED  Mean of the Standardized data is 0 and S.D is 1
  • 13.
    ASSUMPTIONS IN CLUSTERANALYSIS  Impact of Multicollinearity
  • 14.
    DERIVING CLUSTERS ANDASSESSING OVERALL MODEL FIT  Select the partitioning procedure used for forming clusters  Make decisions on the number of clusters to be formed
  • 15.
  • 16.
    HIERARCHICAL CLUSTERING  Hierarchicalprocedure involves a series of n- 1 clustering decisions(where n equals the no of Obs.) that combine into a hierarchy or a tree like structure.  Two basic types of hierarchical clustering procedure are Agglomerative and Divisive.
  • 17.
  • 18.
  • 19.
  • 20.
    K-MEANS CLUSTERING  Incontrast to hierarchical methods, non hierarchical procedure do not involve the tree like construction process. Instead they assign objects into clusters once the number of clusters to be formed is specified.
  • 21.
    SPECIFY CLUSTER SEEDS Selecting seed points - Researcher Specified - Sample Generated
  • 22.
    K-MEANS CLUSTERING ALGORITHM Sequential threshold  Parallel threshold  optimization
  • 23.
    NUMBER OF CLUSTERSTO BE FORMED  Measures of Heterogeneity change  Percentage change in heterogeneity  Measures of Variance change(RMSSTD)  Statistical Measure of heterogeneity change  Direct Measures of Heterogeneity N.B- A rule of thumb in the practical field is that 4-6 cluster will be a perfect fit.
  • 24.
    SHOULD HIERARCHICAL ORNON- HIERARCHICAL PROCEDURE USED  Hierarchical Clustering is not suitable for large datasets as the multitude of calculations involved would be impossibly huge. Thus K-Means clustering is the most used method of clustering.
  • 25.
    CLUSTER PROFILING(K-MEANS)  ClusterSolution is profiled against Variables to identify and assign the character of individual clusters.
  • 26.
    SEGMENTATION EXTREMELY VALUE SEEKER 11% EXTREMELY DISCOUNT MINDED 12% DISCOUNT ORIENTEDBUT ALSO VALUE SEEKER 11% VALUE SEEKER BUT DON'T LET DISCOUNT 16% MIXED ORIANTATION 35% VALUE SEEKER 15% SEGMENTATION
  • 27.