Clustering

CLUSTERING – GROCERY STORES OF RETAILER X IN KARNATAKA & TAMIL NADU

CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
2

1. OBJECTIVE
A. Creation of 2 sets of clusters: K-Means & Hierarchial
B. The clusters should be based on mix of sales by:
i. Category and
ii. Avg. sales per sq. foot of space
3

CONTENTS
1. Objective
2. Methodology
b. Data Preparation
i. Scaling
ii. Weighting
i. Ward’s Method
ii. Density Method
4. Recommendations
4

2. METHODOLOGY
The MEANS Procedure
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
Cat1 Cat1 515 0 120.00 231.82 340.00 66.61 119386.00
Cat2 Cat2 515 0 52.00 150.82 247.00 56.66 77672.00
Cat3 Cat3 515 0 33.00 81.60 212.00 28.44 42022.00
Cat4 Cat4 515 0 90.00 134.37 166.00 20.21 69201.00
Sale Sale 515 0 380.00 598.60 838.00 83.49 308281.00
Size Size 515 0 1200.00 2933.45 3650.00 437.20 1510725.00
Avg_Sales 515 0 0.11 0.21 0.50 0.05 108.34
• Since the given variables, Cat1 – Cat4 are in absolute terms, additional variables PCAT1 – PCAT4 were
calculated next as percentages to understand them better as relative variables
• Avg_Sales was also calculated as an additional variable
• Avg_Sales = Sale / Size 5

METHODOLOGY
Overall analysis
a. Sales from Category 1 are the highest amongst all the four categories of sales. Hence,
Category 1 is the dominating category.
b. However, the standard deviation in the amount of sales from Category 1 is also the
highest amongst all four categories of Sales.
c. The standard deviation in Size of the stores is 437.20 which is on the higher side.
d. The mean size of the stores in both states is 2933 sq feet, the maximum being 3650 sq
feet
e. Assuming that the Sale figures are in '000, the average sale figure per sq foot across all
categories in all stores is 210
6

2. METHODOLOGY
SAS Code:
**Creating additional variable: 'avg. sale per sq. foot' , PCAT1 PCAT2
PCAT3 PCAT4** ;
Data Stores_1 ;
Set Stores ;
Avg_Sales = Sale / Size ;
Run;
Data Stores_1 ;
Set Stores_1 ;
PCAT1 = (Cat1 / (Cat1+Cat2+Cat3+Cat4))*100 ;
Run;
7

2. METHODOLOGY
SAS Code:
**Perform EDA** ;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales;
Run;
Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales;
Run;
Proc Sort Data = Stores_1 ;
By State ;
Run;
Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales;
By State ;
Run;
8

2. METHODOLOGY
SAS Code:
**Perform EDA** ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales;
By State ;
Run;
Proc FREQ Data = Stores_1 ;
Table State ;
Run ;
9

METHODOLOGY
State=KA
PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81
PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84
PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52
PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84
Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00
Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00
Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83
State=TN
PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26
PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24
PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32
PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18
Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00
Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00
Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51
A State-wise analysis of the variables reveals more or less the same patterns for both the states, KA & TN.
Category 1 remains the dominating category across both the states
Although the average size of stores in both states is roughly the same, a comparison of the minimum store size in both the states shows that there are a few smaller stores in
state TN as compared to state KA. 10

METHODOLOGY
State=KA
PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81
PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84
PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52
PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84
Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00
Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00
Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83
State=TN
PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26
PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24
PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32
PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18
Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00
Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00
Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51
The ranking of the four category of products at these stores remains the same in both the states i.e., Sales of Cat1 > Cat2 > Cat4 > Cat3
The mean sale in state TN is higher than that in state KA though not significantly. This is due a lower count of stores in TN as compared to KA as a result of which
TN has a slightly higher mean sales inspite of having lower sales overall.
The total count of stores in state KA is higher (55%) than that in state TN (45%).
The total volume of sales in state KA is higher than that in state TN which is on expected lines given the higher count of stores in KA as compared to TN.
It may therefore be inferred that there are possibly a few stores in state TN that are smaller than the mean size of stores in both states and the average sale per sq. foot in these stores is high.
The average sale per sq. foot is roughly the same in both the states.
11

CONTENTS
1. Objective
2. Methodology
b. Data Preparation
i. Scaling
ii. Weighting
i. Ward’s Method
ii. Density Method
4. Recommendations
12

2. METHODOLOGY
b. Data Preparation
i. Scaling
The data was scaled i.e., the following variables were normalised in order to bring them to a comparable level:
a. PCAT1
b. PCAT2
c. PCAT3
d. PCAT4
e. Avg_Sales
SAS Code:
**SCALING in order to standardize the variables** ;
Proc Standard Data = Stores_1 Mean = 0 Std = 1 Out = Store_2;
Var PCAT1-PCAT4 Avg_Sales;
Run;
13

2. METHODOLOGY
b. Data Preparation
ii Weighting
The variable ‘Avg Sales Per Sq. Foot’ was weighted with several iterations as follows:
Summary of the results of the weighting iterations performed above:
(Detailed results for all the iterations performed above are available on the path: ‘Y:Assignment - ClusteringWeighting’)
Iteration # Weight Assigned
1 2
2 3
3 4
4 5
Cluster Summary: Iteration 1 W=2
Cluster Frequency
RMS Std
Deviation
Maximum Distance
from Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 127 0.8789 4.2013 3 3.3953 3.86
2 3 1.1296 2.3982 1 7.4713 6.61
3 184 0.7815 3.2693 4 2.5422 3.25
4 201 0.9511 4.6159 3 2.5422 2.67
Cluster Frequency
RMS Std
Deviation
Maximum
Distance from
Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 218 0.9394 4.5008 2 3.0989 3.30
2 212 1.0469 3.97 1 3.0989 2.96
3 82 0.9892 4.6592 1 4.5079 4.56
4 3 1.2361 2.6172 3 10.062 8.14
14

2. METHODOLOGY
b. Data Preparation
Cluster Frequency
RMS Std
Deviation
Maximum Distance
from Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 3 1.3713 2.8962 4 13.3713 9.75
2 226 1.1425 4.8821 3 4.0108 3.51
3 205 0.9991 4.4853 2 4.0108 4.01
4 81 1.1858 5.6984 3 5.8659 4.95
Cluster Frequency
RMS Std
Deviation
Maximum
Distance from
Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 202 1.0857 4.4648 2 4.95 4.56
2 229 1.241 5.8923 1 4.95 3.99
3 3 1.5277 3.2196 4 16.71 10.94
4 81 1.4006 6.8317 1 7.28 5.20
The Ratio mentioned above has been calculated using the Difference in Centroids (M) method where:
M = D / d1
D = Average distance b/w cluster centroids
d1 = Average distance b/w cluster members and centroid
15

2. METHODOLOGY
b. Data Preparation
SAS Code:
*1. Iteration 1 : Weight = 2* ;
Data Store_3 ;
Set Store_2 ;
Avg_Sales2 = Avg_Sales*2 ;
Run;
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_3 Out = Cluster_1 Maxclusters = 4 Converge = 0 Maxiter = 20 ;
Var PCAT1-PCAT4 Avg_Sales2;
Run;
Data Store_4 ;
Set Store_3 ;
Run;
16

2. METHODOLOGY
b. Data Preparation
SAS Code:
Run;
Data Store_5 ;
Set Store_4 ;
Run;
Run; 17

2. METHODOLOGY
b. Data Preparation
SAS Code:
Data Store_6 ;
Set Store_5 ;
Run;
Proc FastClus Data = Store_6 Out = Cluster_4 Maxclusters = 4 Converge = 0 Maxiter =
20 ;
Run;
18

CONTENTS
1. Objective
2. Methodology
b. Data Preparation
i. Scaling
ii. Weighting
i. Ward’s Method
ii. Density Method
4. Recommendations
19

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS) : Creation of Preliminary Clusters
For detailed results of the Preliminary Cluster analysis and dignostic plots, please refer to the path: Y:Assignment -
ClusteringPreliminary_Analysis_Outliers.xlsx
Cluster Summary
Cluster Frequency RMS Std Deviation
Maximum Distance from
Seed to Observation Radius Exceeded Nearest Cluster
Distance Between
Cluster Centroids
1 14 0.7555 2.7913 14 2.1932
2 32 0.6678 3.2557 5 1.9093
3 1. 0 11 4.3786
4 60 0.8155 3.286 19 2.2772
5 28 0.6731 2.5255 2 1.9093
6 61 0.7495 2.7836 19 2.6713
7 1. 0 13 4.3876
8 67 0.7811 2.8286 10 2.2277
9 42 0.7811 2.4529 4 2.8634
10 46 0.6996 2.5278 8 2.2277
11 29 0.7186 2.5871 5 2.2318
12 1. 0 13 3.9468
13 1. 0 12 3.9468
14 28 0.6919 2.5985 1 2.1932
15 5 0.6989 1.9953 18 2.3899
16 21 0.6852 2.6402 18 2.1146
17 27 0.6957 2.2399 5 2.198
18 9 0.6001 1.7932 16 2.1146
19 29 0.7757 2.7927 4 2.2772
20 13 0.6923 2.3191 16 2.3297
Hence, Cluster # 3, 7, 12 and 13 appear as outliers with only single observation in each.
The remaining clusters appear to be reasonably sized.
20

2. METHODOLOGY
The following are the details of the clusters that have been identified as outliers: Detection of Outliers
Store_Num CLUSTER
36 3
225 7
360 12
179 13
Cluster=3
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 33.39 38.42 8.80 0.57
PCAT2 12.32 24.93 8.26 1.53
PCAT3 33.07 13.77 4.88 3.96
PCAT4 21.22 22.88 4.71 0.35
Avg_Sales 0.22 0.21 0.05 0.26
Cluster=7
PCAT1 40.14 38.42 8.80 0.20
PCAT2 32.03 24.93 8.26 0.86
PCAT3 4.76 13.77 4.88 1.85
PCAT4 23.08 22.88 4.71 0.04
Avg_Sales 0.45 0.21 0.05 4.50
Cluster=12
PCAT1 40.83 38.42 8.80 0.27
PCAT2 31.60 24.93 8.26 0.81
PCAT3 15.47 13.77 4.88 0.35
PCAT4 12.09 22.88 4.71 2.29
Avg_Sales 0.50 0.21 0.05 5.50
Cluster=13
PCAT1 45.55 38.42 8.80 0.81
PCAT2 15.66 24.93 8.26 1.12
PCAT3 20.11 13.77 4.88 1.30
PCAT4 18.68 22.88 4.71 0.89
Avg_Sales 0.47 0.21 0.05 4.91
21

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
Store_Num CLUSTER Cat1 Cat2 Cat3 Cat4 Size Sale State Avg_Sales PCAT1 PCAT2 PCAT3 PCAT4
36 3 214 79 212 136 2860 641KA 224.13 33.39 12.32 33.07 21.22
225 7 287 229 34 165 1600 715KA 446.88 40.14 32.03 4.76 23.08
360 12 314 243 119 93 1540 769TN 499.35 40.83 31.60 15.47 12.09
179 13 256 88 113 105 1200 562TN 468.33 45.55 15.66 20.11 18.68
• Average size of all the stores in the data set is 2933 sq. feet. Thus for store # 225, 360 & 179 the size is considerably less.
• Avg Sales Per Sq. Foot for all the stores is 210 whereas for Store # 225, 360 & 179 it is more than double the overall mean
avg sales per sq. foot. This is due to the smaller size of these stores as compared to the size of all other stores.
• For Store # 36 the sales from CAT3 has a percentage share 33% of the total sales from that store. Whereas, the average
percentage share of CAT3 in all the stores is appox 14%
22

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers- Diagnostic Plots
23

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers-Diagnostic Plots
24

2. METHODOLOGY
SAS Code:
**Performing the Clustering procedure using K-Means with iterations to determine the optimal no. of clusters** ;
*Conducting a preliminary cluster analysis to detect outliers, if any* ;
Proc Fastclus Data = Store_6 Out = Cluster_Prelim Maxclusters = 20 Converge = 0 Outstat=Stat_Prelim_0;
Run;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Run; 25

2. METHODOLOGY
SAS Code:
Run;
Run;
*Preparation of data set obtained from merging procedures in order to make a cluster wise analysis of
the outliers, if any* ;
Proc Sort Data = Cluster_Prelim ;
By Cluster;
Run;
Data Cluster_Pre_1 ;
Set Cluster_Prelim ;
Keep Store_Num Cluster ;
Run;
26

2. METHODOLOGY
SAS Code:
Proc Export Data = Cluster_Pre_1 outfile = 'Y:Assignment - ClusteringCluster_Pre_1.csv'
DBMS=CSV Replace ;
Run;
*Merging data set named Cluster_Pre_1 with data set Stores_1* ;
Proc Sort Data = Cluster_Pre_1 ;
By Store_Num ;
Run;
By Store_Num;
Run;
Data Store_1_Merged ;
Merge Cluster_Pre_1 (in=a) Stores_1 (in=b) ;
By Store_Num ;
If a and b ;
Run;
Proc Export Data = Store_1_Merged Outfile = 'Y:Assignment - ClusteringStore_1_Merged.csv'
DBMS = CSV Replace ;
Run; 27

2. METHODOLOGY
SAS Code:
Proc Sort Data = Store_1_Merged ;
By Cluster ;
Run;
Proc Means Data = Store_1_Merged Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ;
By Cluster ;
Where Cluster IN(3,7,12,13) ;
Run;
Proc Means Data = Stores_1 Mean Std ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ;
Run;
Proc Means Data = Stores_1 Mean ;
Var Size Avg_Sales ;
Run;
28

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
An alternative approach for detection and treatment of outliers was attempted.
The following are the steps that were undertaken for the process of detection and treatment of outliers:
STEP 1: Run Proc FASTCLUS with many clusters and OUTSEED = output data set for diagnostic plot
(Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step1_Mean1.xlsx’)
29

2. METHODOLOGY
STEP 2: Remove low frequency clusters
The data set, MEAN1, generated in the above step was used to remove low frequency clusters ( < 5) and clusters with a
frequency of 5 or more were retained for subsequent analysis.
The data set with clusters having 5 or more frequency was named as 'Seed1'.
STEP 3: Proc FASTCLUS was run again selecting seeds from high frequency clusters obtained in data set SEED1 in Step 2
above using LEAST = 1 Clustering Criterion
Value for LEAST should be < 2 in order to reduce the effect of outliers on cluster centers
(Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step3_LEAST.xlsx’)
STEP 4: Proc FASTCLUS was run again selecting seeds from high frequency clusters in previous analysis with STRICT=3
preventing outliers from distorting the results
Value of STRICT = 3 was chosen to be close to _GAP_ & _RADIUS_ values of the larger clusters in the diagnostic plots.
30

2. METHODOLOGY
However, STRICT option is not supported in WPS for Proc FASTCLUS in the present version.
Subsequently, a final Proc FASTCLUS could not be run to assign outliers and tails to clusters using seeds that would have been
generated from using STRICT option above.
SAS Code:
***Another method for identification and treatment of outliers*** ;
*STEP 1 : Run PROC FASTCLUS with many clusters and OUTSEED = output data set for
diagnostic plot*;
Proc Fastclus Data = Store_6 Outseed = Mean1 Maxclusters = 20 Maxiter = 0 Summary ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Axis1 Label = (Angle=90 Rotate=0) Minor=None Order=(0 to 10 by 2) ;
Axis2 minor = None ;
Proc Gplot Data = Mean1 ;
Plot _GAP_*_FREQ_ _RADIUS_*_FREQ_ / Overlay Frame
cframe = ligr vaxis = axis1 haxis=axis2 legend= legend1 ;
Run; 31

2. METHODOLOGY
SAS Code:
*Step 2 :Remove Low Frequency clusters* ;
Data Seed1 ;
Set Mean1 ;
If _FREQ_ >=5 ;
Run;
*Step 3 : Run Proc Fastclus again selecting seeds from high frequency clusters in previous analysis using
LEAST = 1 Clustering Criterion since value < 2 reduce the effect of outliers on cluster centers* ;
Proc FASTCLUS Data = Store_6 Seed = Seed1 Maxclusters = 8 Least = 1 Out = Store_6_Least ;
Run;
Legend1 Frame Cframe = ligr Label = None CBorder = Black
Position=Center Value= (Justify=Center) ;
Axis1 Label =(Angle=90 Rotate=0) Minor=None ;
Axis2 Minor=None ;
Proc Gplot Data = Store_6_Least ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
32

2. METHODOLOGY
alternative approachSAS Code:
Run;
Run;
Run;
*Step 4 : Run Proc Fastclus again, selecting seeds from high frequency clusters in previous analysis with STRICT = to
prevent the outliers from distorting the
results*
*Value of STRICT = is chosen to be close to the _GAP_ & _RADIUS_ values of the large clusters in the diagnostic plot* ;
Proc Fastclus Data = Store_6 Seed = Seed1 Maxclusters = 8 Strict=3 out = Store_6_Strict Outseed = Mean2 ;
Run; 33

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
Performing iterations for determining the appropriate no. of clusters using K-Means (PROC FASTCLUS)
From the procedure run in Step 1 of the alternative method discussed in the preceeding slide for outlier detection, it was found that
8 and above could be a good no. for meaningful cluster formation.
Hence, the iterations below begin with Maxclusters = 8-10
Clustering is performed on the data set from which the outliers have been removed.
Iteration 1 : Maxclusters = 8
Statistic used for comparison
Name (as it
appears in the
output) Value_Iteration1
Pseudo F Stat Pseudo F Statistics 451.71
Appox. Expected overall R^2 Overall R-Square 0.84
Detailed output and plots on path Y:Assignment -
ClusteringIteration_1_Maxclust_8.xlsx
Name (as it
appears in the
Name (as it
appears in the
Points considered for a comparison of the above 3 iterations:
1 Relatively large values of Pseudo F Stat indicate a stopping point
2 Higher values of overall R-Square are desirable
3 Increasing the no. of clusters although not much differentiation exists amongst the
iterations means devising more marketing strategies unique to each cluster.
Given a cost vs. benefit analysis, it is preferable to have a smaller no. of clusters.
Hence, iteration 2 wherein 9 clusters are formed seems most appropriate in the
present case. 34

2. METHODOLOGY
SAS Code:
*Deleting the outliers found (in the procedures above) from the scaled and weighted data set* ;
Data Store_6_Final ;
Set Store_6 ;
If Store_Num IN(36 179 225 360) Then Delete ;
Run;
*Iteration 1 : Maxclusters = 8 * ;
Proc FastClus Data = Store_6_Final Maxclusters = 8 Maxiter= 20 Converge = 0 Out=Clusters_8 ;
Run;
*Generating Plots of clusters*;
35

2. METHODOLOGY
SAS Code:
Proc Gplot Data = Clusters_8 ;
Run;
Run;
Run;
Run;
36

2. METHODOLOGY
SAS Code:
Proc FastClus Data = Store_6_Final Maxclusters = 9 Maxiter= 20 Converge = 0 Mean= Mean_Clusters_9 Out=Clusters_9
;
Run;
Run;
37

2. METHODOLOGY
SAS Code:
Run;
Run;
Run;
Proc FastClus Data = Store_6_Final Maxclusters = 10 Maxiter= 20 Converge = 0 Out=Clusters_10 ;
Run;
38

2. METHODOLOGY
SAS Code:
Run;
Run;
39

2. METHODOLOGY
SAS Code:
Run;
Run;
*Merging the data sets for analysis of the final clusters formed* ;
By Store_Num ;
Run;
Data Stores_1_Final ;
Set Stores_1 ;
If Store_Num IN(36 179 225 360) Then Delete ;
Run;
40

2. METHODOLOGY
SAS Code:
Data Cluster_9_Final ;
Set Clusters_9 ;
Keep Store_Num Cluster ;
Run;
Proc Sort Data = Cluster_9_Final ;
By Store_Num ;
Run;
Data Stores_1_Final_Merged ;
Merge Stores_1_Final (in=a) Cluster_9_Final (in=b);
By Store_Num ;
If a and b ;
Run;
41

CONTENTS
1. Objective
2. Methodology
b. Data Preparation
i. Scaling
ii. Weighting
i. Ward’s Method
ii. Density Method
4. Recommendations
42

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters
Cluster Summary
Cluster Frequency RMS Std Deviation
Maximum
Distance from
Seed to
Observation Radius Exceeded
Nearest
Cluster
Distance
Between Cluster
Centroids Ratio
1 46 0.8678 3.424 3 3.3888 3.91
2 57 0.9084 3.2855 6 3.7655 4.15
3 83 0.819 2.9486 5 2.8481 3.48
4 41 0.7917 2.8507 6 2.8616 3.61
5 99 0.8254 2.6116 3 2.8481 3.45
6 67 0.773 3.0078 4 2.8616 3.70
7 81 0.7999 2.9993 4 2.9345 3.67
8 7 0.8229 2.68 9 3.2939 4.00
9 30 0.7436 2.6002 8 3.2939 4.43
• Ratio has been calculated using the ‘Difference in Centroids’ method as D / d1 where:
D = Average distance b/w cluster centroids
d1 = Average distance b/w members and cluster centroid
• Thus, the ratio signifies the strength of the clusters formed and is a measure of the
homogeneity within compared to the heterogeneity outside
• Cluster 9 is the strongest of all other cluster formations followed by Cluster 2 & Cluster 8
43

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters
The 9 clusters obtained in the preliminary cluster analysis have been evaluated and profiled as under in order to gain insights
into the variables that are most dominating in the cluster formation:(Detailed output on path ‘Y:Assignment –
ClusteringIteration_2_Maxclust_9.xlsx)
Cluster=1
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 46 39.92 38.41 8.82 0.17
PCAT2 46 27.07 24.95 8.25 0.26
PCAT3 46 12.58 13.73 4.80 0.24
PCAT4 46 20.42 22.91 4.70 0.53
Avg_Sales_Final 46 272.14 208.80 48.77 1.30
Cluster=2
PCAT1 57 33.41 38.41 8.82 0.57
PCAT2 57 20.16 24.95 8.25 0.58
PCAT3 57 16.95 13.73 4.80 0.67
PCAT4 57 29.49 22.91 4.70 1.40
Avg_Sales_Final 57 142.44 208.80 48.77 1.36
Cluster=3
PCAT1 83 39.70 38.41 8.82 0.15
PCAT2 83 26.23 24.95 8.25 0.16
PCAT3 83 13.58 13.73 4.80 0.03
PCAT4 83 20.49 22.91 4.70 0.52
Avg_Sales_Final 83 236.59 208.80 48.77 0.57
Cluster=4
PCAT1 41 31.45 38.41 8.82 0.79
PCAT2 41 20.68 24.95 8.25 0.52
PCAT3 41 21.16 13.73 4.80 1.55
PCAT4 41 26.71 22.91 4.70 0.81
Avg_Sales_Final 41 183.11 208.80 48.77 0.53
Cluster=5
PCAT1 99 37.08 38.41 8.82 0.15
PCAT2 99 28.40 24.95 8.25 0.42
PCAT3 99 12.62 13.73 4.80 0.23
PCAT4 99 21.90 22.91 4.70 0.22
Avg_Sales_Final 99 207.18 208.80 48.77 0.03
Cluster=6
PCAT1 67 31.57 38.41 8.82 0.78
PCAT2 67 33.61 24.95 8.25 1.05
PCAT3 67 10.78 13.73 4.80 0.61
PCAT4 67 24.04 22.91 4.70 0.24
Avg_Sales_Final 67 173.18 208.80 48.77 0.7344

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
Cluster=7
PCAT1 81 49.76 38.41 8.82 1.29
PCAT2 81 15.03 24.95 8.25 1.20
PCAT3 81 12.69 13.73 4.80 0.22
PCAT4 81 22.52 22.91 4.70 0.08
Avg_Sales_Final 81 183.23 208.80 48.77 0.52
Cluster=8
PCAT1 7 39.64 38.41 8.82 0.14
PCAT2 7 26.63 24.95 8.25 0.20
PCAT3 7 13.52 13.73 4.80 0.04
PCAT4 7 20.21 22.91 4.70 0.58
Avg_Sales_Final 7 351.02 208.80 48.77 2.92
Cluster=9
PCAT1 30 40.27 38.41 8.82 0.21
PCAT2 30 28.77 24.95 8.25 0.46
PCAT3 30 12.71 13.73 4.80 0.21
PCAT4 30 18.25 22.91 4.70 0.99
Avg_Sales_Final 30 316.83 208.80 48.77 2.21
45

2. METHODOLOGY
1. 7% of the total stores have their avg sales per sq. foot significantly higher than the overall average.
2. 11% of the total stores have significantly higher than overall average sales in the category of Tobacco &
Alcohol.
3. 16% of the total stores have lower than the overall average sales in the category of Tobacco & Alcohol
though the difference is not significant.
4. 32% of the total stores have higher than overall average sales in the category of Frozen Foods. Average
sales in Cluster 6 for the category of Frozen Foods is significantly higher than the overall mean sales for
the same category.
Cluster # No. of stores
8 7
9 30
2 57
3 83
5 99
6 67
46

The FREQ Procedure
Table of CLUSTER by State
CLUSTER (Cluster) State (State) Total
KA TN
Frequency
Percent
1 30 16 46
5.87 3.13 9
2 33 24 57
6.46 4.7 11.15
3 44 39 83
8.61 7.63 16.24
4 25 16 41
4.89 3.13 8.02
5 49 50 99
9.59 9.78 19.37
6 37 30 67
7.24 5.87 13.11
7 43 38 81
8.41 7.44 15.85
8 3 4 7
0.59 0.78 1.37
9 16 14 30
3.13 2.74 5.87
Total
280 231 511
54.79 45.21 100
2. METHODOLOGY
The overall bias in the number of stores is towards the state KA with 55% of the total stores being in KA.
No other significant pattern in the distribution of stores has emerged.
47

2. METHODOLOGY
Analysis Variable: Size
Cluster N Obs N Mean
Population
Mean
Population
Std Dev Z-Score Minimum Maximum
1 46 46 2471.52 2942.32 423.52 1.11 1700 2910
2 57 57 3334.74 2942.32 423.52 0.93 2550 3650
3 83 83 2761.39 2942.32 423.52 0.43 1925 3330
4 41 41 3040.98 2942.32 423.52 0.23 2180 3650
5 99 99 2985.56 2942.32 423.52 0.10 2000 3610
6 67 67 3184.63 2942.32 423.52 0.57 2200 3630
7 81 81 3172.10 2942.32 423.52 0.54 2600 3650
8 7 7 1977.14 2942.32 423.52 2.28 1550 2150
9 30 30 2205.33 2942.32 423.52 1.74 1750 2520
Appox. 7% of all the stores have a mean size significantly lower than the overall size of all the stores.
The split of these stores b/w the two states is roughly the same and there is no discerning pattern
observed.
48

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9
clusters)
49

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9
clusters)
50

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters)
SAS Code:
*Profiling the 9 clusters obtained in the preceeding procedures* ;
Proc Sort Data = Stores_1_Final_Merged ;
By Cluster ;
Run;
Data Stores_1_Final_Merged ;
Set Stores_1_Final_Merged;
Avg_Sales_Final = Avg_Sales * 1000 ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
By Cluster ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean Std;
Run; 51

2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters)
SAS Code:
Proc Means Data = Stores_1_Final_Merged N ;
Class State ;
Run;
Proc Freq Data = Stores_1_Final_Merged ;
Tables Cluster * State / nocol norow ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean ;
Var Size ;
Class Cluster ;
Run;
Proc Means Data = Stores_1_Final_Merged Mean Std ;
Var Size ;
Run;
52

CONTENTS
1. Objective
2. Methodology
b. Data Preparation
i. Scaling
ii. Weighting
i. Ward’s Method
ii. Density Method
4. Recommendations
53

2. METHODOLOGY
With 9 clusters, treated for outliers, obtained from the preliminary cluster analysis using PROC FASTCLUS procedure (K-Means
Method for Clustering), Hierarchial Clustering is performed next using the PROC CLUSTER procedure to obtain the final no. of
clusters.
The following methods are used for Hierarchial Clustering:
Note:
K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis.
Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set.
Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9.
S.No Method # of Clusters obtained Remarks
1 Ward's Method 3
The scatter diagram of the clusters obtained revealed cluster formations that were not well
demarcated
Also, the profiling of these 3 clusters didn't reveal any variable that was dominant in the formation of the
clusters.
For detailed results, refer to the tab named 'Output_Wards' &
'Output_Final_Profiling_W'
2 Density Method Ties were observed while Density method was used. Based on the position of the Ties in the Cluster History,
the clusters obtained when K=7 were finalized.
K=7 5 For detailed results, refer to the tab named 'Output_Density_K7' & 'Output_Final_Profiling_D'
K=8 4 For detailed results, refer to the tab named 'Output_Density_K8'.
K=9 5 For detailed results, refer to the tab named 'Output_Density_K9'.
54

2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’ (refer to tab named ‘Output_Wards’)
Cluster History
Number Of Clusters First Cluster Joined
Second
Cluster
Jointed Frequency Of New Cluster Semipartial RSq RSquared Pseudo F Statistic Pseudo t-squared Approximate Expected RSq Cubic Clustering Criteria Tie
8 8 9 37 0.0054 0.9913,000 . . .
7 4 6 108 0.0184 0.98 3436. . .
6 1 3 129 0.0301 0.95 1772. . .
5 CL7 7 189 0.0322 0.91 1343 327. .
4 CL5 5 288 0.0438 0.87 1131 248. .
3 2 CL4 345 0.0952 0.77 874 346. .
2 CL6 CL8 166 0.1266 0.65 938 585. .
1 CL2 CL3 511 0.6483 0. 938 0 0
# of clusters according to:
Pseudo T-Square: 3, 2
Semipartial R-Square: 8,7,6,5,4,3
Therefore, final # of clusters considered on the basis of the results of Ward's Method = 3
The Cluster History, from the Ward’s method, is as below:
55

2. METHODOLOGY
The Tree diagram, from the Ward’s method, is as below:
56

2. METHODOLOGY
The following are the plots obtained from the Ward’s method:
57

2. METHODOLOGY
The following are the plots obtained from the Ward’s method:
58

2. METHODOLOGY
The 3 clusters obtained from the Ward’s method have been profiled as below:
Analysis Variable: Cluster_Final
Cluster_Final N Obs N
1 103 103
2 242 242
3 166 166
Cluster_Final=1
Variable N Mean Popltn Mean Popltn Std. Dev Z-Score
PCAT1 103 36.32 38.41 8.82 0.24
PCAT2 103 23.25 24.95 8.25 0.21
PCAT3 103 15.00 13.73 4.80 0.26
PCAT4 103 25.44 22.91 4.70 0.54
Avg_Sales_Final 103 200.36 208.80 48.77 0.17
Cluster_Final=2
PCAT1 242 41.74 38.41 8.82 0.38
PCAT2 242 21.87 24.95 8.25 0.37
PCAT3 242 14.46 13.73 4.80 0.15
PCAT4 242 21.94 22.91 4.70 0.21
Avg_Sales_Final 242 222.93 208.80 48.77 0.29
Cluster_Final=3
PCAT1 166 34.85 38.41 8.82 0.40
PCAT2 166 30.50 24.95 8.25 0.67
PCAT3 166 11.88 13.73 4.80 0.39
PCAT4 166 22.76 22.91 4.70 0.03
Avg_Sales_Final 166 193.45 208.80 48.77 0.31
Table of Cluster_Final by State
Cluster_Final State (State) Total
KA TN
Frequency
Percent
1 63 40 103
12.33 7.83 20.16
2 131 111 242
25.64 21.72 47.36
3 86 80 166
16.83 15.66 32.49
Total 280 231 511
54.79 45.21 100
Analysis Var_Size
Cluster_Final Mean Size
Popltn
Mean
Popltn Std.
Dev Z-Score
1 2949.22 2942.32 423.52 0.016
2 2854.61 2942.32 423.52 0.207
3 3065.9 2942.32 423.52 0.292
Conclusion:
• Thus, both the graphical plots as well as the summary
stats of the 3 clusters obtained using the Ward’s method
reveal no clear cluster formation.
• As such, no particular variable has been found
dominating in any of the 3 cluster formations.
59

2. METHODOLOGY
SAS Code:
**Hierarchial Clustering procedure being performed on the 9 preliminary clusters obtained using K-Means** ;
**The data set using which K-Means clustering was performed to obtain the preliminary 9 clusters has been treated for
outliers and hence doesn't contain any outliers** ;
*Ward's Method* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_W Method = Ward CCC Pseudo ;
Copy Cluster ;
ID Cluster ;
Run;
Proc Tree Data = Tree_9_W Horizontal Lines=(color=blue)
out = Tree_Out_9_W nclusters = 3 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
ID Cluster ;
Run;
Proc Print Data = Tree_Out_9_W ;
Run; 60

2. METHODOLOGY
SAS Code:
*Profiling of the Clusters formed using Ward's Method* ;
*Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary
cluster analysis have been mapped to the
final 3 clusters obtained by the Ward's method* ;
Data Stores_Final_Analysis_W ;
Set Stores_1_Final_Merged ;
If Cluster = 1 OR Cluster = 2 Then Cluster_Final_W = 1 ;
Else If Cluster = 5 OR Cluster = 6 Then Cluster_Final_W = 3 ;
Else If Cluster = 3 OR Cluster= 4 OR Cluster=7 OR Cluster= 8 OR Cluster= 9 Then Cluster_Final_W = 2 ;
Run ;
Proc Sort Data = Stores_Final_Analysis_W ;
By Cluster_Final_W ;
Run;
Proc Means Data = Stores_Final_Analysis_W N;
Var Cluster_Final_W;
Class Cluster_Final_W ;
Run; 61

2. METHODOLOGY
SAS Code:
Data Stores_Final_Analysis_W ;
Set Stores_Final_Analysis_W;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean ;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean Std;
Run;
Proc Freq Data = Stores_Final_Analysis_W ;
Tables Cluster_Final_W*State / nocol norow nocum;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean ;
Var Size ;
Run;
62

2. METHODOLOGY
SAS Code:
Proc Means Data = Stores_Final_Analysis_W Mean Std ;
Var Size ;
Run;
Legend1 Frame Cframe = ligr cborder=black
position=center value=(justify=center) ;
Axis1 label=(angle=90 rotate=0) minor=none ;
Axis2 minor=none ;
Proc Gplot ;
Plot PCAT1 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot ;
Run;
Proc Gplot ;
Run;
Proc Gplot ;
Run;
63

2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The output for the Density method discussed below and in the following slides is when K=7.
(Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’. Refer tab named ‘Output_Density_K7’. For output when
K=8 & K=9 refer tab named ‘Output_Density_K8’ & ‘Output_Density_K9’.)
Cluster History
Number Of
Clusters
First Cluster
Joined
Second Cluster
Jointed
Frequency Of
New Cluster
Semipartial
RSq RSquared
Pseudo F
Statistic
Pseudo t-
squared
Approximate
Expected RSq
Cubic
Clustering
Criteria
Normalized
Fusion
Density
Lesser
Density
Greater
Density Tie
8 3 5 182 0.0324 0.97 2147. . . 61.799 44.7166 100
7 CL8 7 263 0.0826 0.89 647 665. . 38.79 24.0617 100
6 CL7 1 309 0.1255 0.76 319 335. . 35.798 21.8011 100 T
5 CL6 4 350 0.0541 0.71 303 78.3. . 35.798 21.8011 100
4 CL5 6 417 0.0911 0.61 269 128. . 26 14.9422 100
3 CL4 2 474 0.1869 0.43 190 229. . 7.2274 3.7492 100
2 CL3 9 504 0.3124 0.12 66.2 274. . 6.0544 3.1217 100
1 CL2 8 511 0.1151 0 . 66.2 0 0 2.1174 1.07 100
# of clusters according to:
Pseudo T-Square: 5, 4
Semipartial R-Square: 8,7,5,4
Therefore, final # of clusters considered in this iteration = 5
Since the Tie occurs in the early history of the cluster formation, it should have only a little effect on the later
stages and hence can be overlooked. 64

2. METHODOLOGY
The Tree diagram, from the Density method when K=7, is as below:
65

2. METHODOLOGY
The following are the plots obtained from the Density method when K=7:
66

2. METHODOLOGY
The following are the plots obtained from the Density method when K=7:
67

2. METHODOLOGY
The following elaborates on the profiles of the final 5 clusters obtained from the Density method:
Analysis Variable: Cluster_Final_D
Cluster_Final_D N Obs N
1 326 326
2 67 67
3 81 81
4 7 7
5 30 30
Cluster_Final_D=1
Variable N Mean Popltn Mean
Popltn Std
Dev Z-Score
PCAT1 326 36.80 38.41 8.82 0.18
PCAT2 326 25.25 24.95 8.25 0.04
PCAT3 326 14.69 13.73 4.80 0.20
PCAT4 326 23.26 22.91 4.70 0.07
Avg_Sales_Final 326 209.48 208.80 48.77 0.01
• No particular variable has emerged as a dominating
variable responsible for the formation of this cluster.
• Mean values of the variables in this cluster are very
near to the overall mean scores of the variables in the
data set.
Legend:
Cat1 Fresh Foods
Cat2 Frozen Foods
Cat3 Health & Beauty
Cat4 Tobacco & Alcohol
Cluster_Final_D=2
Variable N Mean Popltn Mean
Popltn Std
Dev Z-Score
PCAT1 67 31.57 38.41 8.82 0.78
PCAT2 67 33.61 24.95 8.25 1.05
PCAT3 67 10.78 13.73 4.80 0.61
PCAT4 67 24.04 22.91 4.70 0.24
Avg_Sales_Final 67 173.18 208.80 48.77 0.73
PCAT2 has emerged as a dominating variable and is the
most determining variable in the formation of this cluster
with nearly 13% of the total no. of stores having a mean
higher than 25%.
68

2. METHODOLOGY
Cluster_Final_D=3
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 81 49.76 38.41 8.82 1.29
PCAT2 81 15.03 24.95 8.25 1.20
PCAT3 81 12.69 13.73 4.80 0.22
PCAT4 81 22.52 22.91 4.70 0.08
Avg_Sales_Final 81 183.23 208.80 48.77 0.52
Both PCAT1 and PCAT2 have emerged as the
dominating variables in Cluster 1 with nearly 16%
of the total no. of stores having a mean higher
than the overall mean of these 2 categories.
Cluster_Final_D=4
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 7 39.64 38.41 8.82 0.14
PCAT2 7 26.63 24.95 8.25 0.20
PCAT3 7 13.52 13.73 4.80 0.04
PCAT4 7 20.21 22.91 4.70 0.58
Avg_Sales_Final 7 351.02 208.80 48.77 2.92
Avg Sales per Sq. Foot has emerged as the dominating
variable in Cluster 4 with mean avg sales per sq. foot
significantly higher than the mean overall avg sales per sq.
foot with nearly 1.4% of the total no. of stores having a
mean greater than the overall mean of avg sales per sq.
foot.Cluster_Final_D=5
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 30 40.27 38.41 8.82 0.21
PCAT2 30 28.77 24.95 8.25 0.46
PCAT3 30 12.71 13.73 4.80 0.21
PCAT4 30 18.25 22.91 4.70 0.99
Avg_Sales_Final 30 316.83 208.80 48.77 2.21
Avg Sales per Sq. Foot has emerged as the dominating
variable in Cluster 5 with mean avg sales per sq. foot
significantly higher than the mean overall avg sales per sq.
foot with nearly 6% of the total no. of stores having a mean
greater than the overall mean of avg sales per sq. foot.
69

2. METHODOLOGY
The FREQ Procedure
Table of Cluster_Final_D by State
Cluster_Final_D State (State) Total
KA TN
Frequency
Percent
1 181 145 326
35.42 28.38 63.8
2 37 30 67
7.24 5.87 13.11
3 43 38 81
8.41 7.44 15.85
4 3 4 7
0.59 0.78 1.37
5 16 14 30
3.13 2.74 5.87
Total 280 231 511
54.79 45.21 100
No specific pattern has emerged in the state-wise
analysis of the clusters formed.
Analysis Var_Size
Cluster_Final Mean Size
Popltn
Mean
Popltn Std.
Dev Z-Score
1 2923.97 2942.32 423.52 0.04
2 3184.63 2942.32 423.52 0.57
3 3172.1 2942.32 423.52 0.54
4 1977.14 2942.32 423.52 2.28
5 2205.33 2942.32 423.52 1.74
• The average size of the stores in cluster 4 is much lesser
than the overall average size of the stores in the given data
set.
• Hence, the avg sales per sq. foot for stores in this cluster is
also significantly higher than the overall average sales per
sq. foot.
70

2. METHODOLOGY
SAS Code:
*Density Method* ;
*K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis* ;
*Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set*;
*Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9* ;
*K = 7* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D7 Method = Density K=7 CCC Pseudo ;
Copy Cluster ;
ID Cluster ;
Run;
Proc Tree Data = Tree_9_D7 Horizontal Lines=(color=blue)
out = Tree_Out_9_D7 nclusters=5 ;
ID Cluster ;
Run;
Proc Print Data = Tree_Out_9_D7 ;
Run; 71

2. METHODOLOGY
SAS Code:
*Profiling of the Clusters formed using Density Method for K=7* ;
*Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary cluster
analysis have been mapped to the final 5 clusters obtained by the Density method* ;
Data Stores_Final_Analysis_D ;
Set Stores_1_Final_Merged ;
If Cluster = 1 OR Cluster = 2 OR Cluster = 3 OR Cluster = 4 OR Cluster = 5 Then Cluster_Final_D = 1 ;
Else If Cluster = 6 Then Cluster_Final_D = 2 ;
Run ;
Proc Sort Data = Stores_Final_Analysis_D ;
By Cluster_Final_D ;
Run;
Proc Means Data = Stores_Final_Analysis_D N;
Var Cluster_Final_D;
Class Cluster_Final_D ;
Run;
72

2. METHODOLOGY
SAS Code:
Data Stores_Final_Analysis_D ;
Set Stores_Final_Analysis;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean ;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean Std;
Run;
Proc Freq Data = Stores_Final_Analysis_D ;
Tables Cluster_Final_D*State / nocol norow nocum;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean ;
Var Size ;
Run;
73

2. METHODOLOGY
SAS Code:
Proc Means Data = Stores_Final_Analysis_D Mean Std ;
Var Size ;
Run;
Proc Export Data = Stores_Final_Analysis_D Outfile = 'Y:Assignment - ClusteringStores_Final_Analysis_D.csv'
DBMS= CSV Replace ;
Run;
Legend1 Frame Cframe = ligr cborder=black
position=center value=(justify=center) ;
Axis1 label=(angle=90 rotate=0) minor=none ;
Axis2 minor=none ;
Proc Gplot ;
Plot PCAT1 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot ;
Run;
74

2. METHODOLOGY
SAS Code:
Proc Gplot ;
Run;
Proc Gplot ;
Run;
*K = 8* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D8 Method = Density K=8 CCC Pseudo ;
Copy Cluster ;
Run;
Run;
Run;
75

2. METHODOLOGY
SAS Code:
• *K = 9* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D9 Method = Density K=9 CCC
Pseudo ;
Copy Cluster ;
Run;
Run;
Run;
76

CONTENTS
1. Objective
2. Methodology
b. Data Preparation
i. Scaling
ii. Weighting
i. Ward’s Method
ii. Density Method
4. Recommendations
77

3. SUMMARY OF INSIGHTS
1. 16% of the stores have their mean sales from the Fresh Food category higher than the overall average in this category.
2. 13% of the stores have their mean sales from the Frozen Food category higher than the overall average in this category.
3. 16% of the stores have their mean sales from the Frozen Food category lower than the overall average in this category.
4. The % sales from the category of Health & Beauty in all the clusters formed above is nearly around the overall mean sales of this category.
5. Only 5% of the total no. of stores have their mean sales from the category Tobacco & Alcohol lower than the overall mean sales of this category.
6. 7% of the total stores have their average sales per sq. foot significantly higher than the overall average. The difference is particularly more pronounced
for stores in Cluster 4 in which the average size of the stores is also much lesser than the overall average size.
7. 29% of the total stores have their average sales per sq. foot significantly lower than the overall average.
3 81
2 67
3 81
4 7
5 30
2 67
3 81 78

CONTENTS
1. Objective
2. Methodology
b. Data Preparation
i. Scaling
ii. Weighting
i. Ward’s Method
ii. Density Method
4. Recommendations
79

4. RECOMMENDATIONS
1. Cluster 3
a The size of stores in this cluster is higher than the average size of all stores though the difference is not significant.
b When compared to the overall mean sales of all stores from Fresh Foods category, the contribution to revenue from the Fresh Food Category
is highest from stores in this cluster.
c However, when compared with the overall mean sales of all stores from the category of Frozen Foods, the contribution to revenue from the Frozen Food category is
lowest from stores in this cluster.
d The average sales per square foot from stores in this cluster is also lower when compared with the overall average sales per sq. foot of all stores.
e The above observations therefore imply that although Fresh Foods category is contributing the most to the sales but perhaps this contribution is not enough to
increase the overall sales of the stores which are lesser than the average of all other stores despite a greater size of stores.
There is therefore a need to may be adopt techniques such as better placement of such products or a promotional campaign targeted specifically for products in this
category.
Strategies may also be devised for promoting sales from Frozen Food category as they are significantly lesser than the overall average sales of this
category in other stores.
One possibility is that sales from Fresh Foods category is cannibalizing the sales from Frozen Foods category and hence an alternative shelf placement is
required.
80

4. RECOMMENDATIONS
2. Cluster 2
a As compared to stores in Cluster 3, a contrasting situation is seen for stores in this cluster.
b
The sales from Frozen Food category are contributing the most to the overall revenue of stores in this cluster and are greater than the overall mean sales from this
category in all other stores
Whereas, sales from the Fresh Food category are lower than the overall mean sales from this category in other stores.
c The average size of stores in this cluster is roughly the same as the size of stores in Cluster 3 and is higher than the overall mean size of other stores.
d
Also, the average sales per sq. foot is lesser than the overall average sales per sq. foot of other stores. They are also lesser than the average sales per sq. foot of stores
in Cluster 3.
e
Hence, strategies similar to those to be adopted for stores in Cluster 3 may also be replicated for stores in Cluster 2 for promoting sales from both the Fresh Foods category as
well as the Frozen Foods category.
This may be done after gaining insights into the factors that are driving the Frozen Food sales in stores of Cluster 2 and Fresh Food sales in stores of Cluster 3.
81

4. RECOMMENDATIONS
3. Cluster 1
4. Cluster 4
a Stores in Cluster 1, roughly 64%, are highest in no. as compared to stores in other clusters.
b Sales from all 4 categories of products of stores in this cluster are very close to the overall mean sales of each of the four categories in all the stores.
c The average size of the stores in this cluster is also very close to the overall average size of all stores.
d
Since this cluster has the highest and a significant % of no. of stores, promotional activities adopted for all these stores can perhaps also help in
significantly increasing the overall sales volume of the Retailer X.
a
This cluster houses only 1% of the total stores with the only differentiating factor being the average sales per sq. foot which is significantly higher than the overall
average for other stores.
b The mean sales of products in each of the 4 categories is very similar to the overall mean sales of those categories.
c Hence, the only possible reason for a significantly higher average sales per sq. foot is the lower than overall average size of the stores.
No specific state-wise pattern has emerged for these stores with the distribution being fairly consistent in both the states, KA & TN.
82

4. RECOMMENDATIONS
5. Cluster 5
6.
a This cluster houses nearly 6% of the total stores
b The distribution of variables for stores in this cluster is almost similar to stores in Cluster 4.
c However, it may be noted that sales from the Tobacco & Alcohol category are lower than the overall mean sales of other stores from this category.
Having identified the drivers of sales in stores of each of the 5 clusters, it is next important to understand other factors that influence each of these
drivers.
Inclusion of demographic factors such as age, income, location, gender etc. as additional variables, may give better insights into the promotional strategies, unique to
each cluster, that may be adopted for increasing the sales.
83

Clustering

Recommended

Recommended

More Related Content

Similar to Clustering

Similar to Clustering (20)

Clustering