SlideShare a Scribd company logo
1 of 84
CLUSTERING – GROCERY STORES OF RETAILER X IN KARNATAKA & TAMIL NADU
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
2
1. OBJECTIVE
A. Creation of 2 sets of clusters: K-Means & Hierarchial
B. The clusters should be based on mix of sales by:
i. Category and
ii. Avg. sales per sq. foot of space
3
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
4
2. METHODOLOGY
a. Exploratory Data Analysis
The MEANS Procedure
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
Cat1 Cat1 515 0 120.00 231.82 340.00 66.61 119386.00
Cat2 Cat2 515 0 52.00 150.82 247.00 56.66 77672.00
Cat3 Cat3 515 0 33.00 81.60 212.00 28.44 42022.00
Cat4 Cat4 515 0 90.00 134.37 166.00 20.21 69201.00
Sale Sale 515 0 380.00 598.60 838.00 83.49 308281.00
Size Size 515 0 1200.00 2933.45 3650.00 437.20 1510725.00
Avg_Sales 515 0 0.11 0.21 0.50 0.05 108.34
• Since the given variables, Cat1 – Cat4 are in absolute terms, additional variables PCAT1 – PCAT4 were
calculated next as percentages to understand them better as relative variables
• Avg_Sales was also calculated as an additional variable
• Avg_Sales = Sale / Size 5
METHODOLOGY
a. Exploratory Data Analysis
Overall analysis
a. Sales from Category 1 are the highest amongst all the four categories of sales. Hence,
Category 1 is the dominating category.
b. However, the standard deviation in the amount of sales from Category 1 is also the
highest amongst all four categories of Sales.
c. The standard deviation in Size of the stores is 437.20 which is on the higher side.
d. The mean size of the stores in both states is 2933 sq feet, the maximum being 3650 sq
feet
e. Assuming that the Sale figures are in '000, the average sale figure per sq foot across all
categories in all stores is 210
6
2. METHODOLOGY
a. Exploratory Data Analysis
SAS Code:
**Creating additional variable: 'avg. sale per sq. foot' , PCAT1 PCAT2
PCAT3 PCAT4** ;
Data Stores_1 ;
Set Stores ;
Avg_Sales = Sale / Size ;
Run;
Data Stores_1 ;
Set Stores_1 ;
PCAT1 = (Cat1 / (Cat1+Cat2+Cat3+Cat4))*100 ;
PCAT2 = (Cat2 / (Cat1+Cat2+Cat3+Cat4))*100 ;
PCAT3 = (Cat3 / (Cat1+Cat2+Cat3+Cat4))*100 ;
PCAT4 = (Cat4 / (Cat1+Cat2+Cat3+Cat4))*100 ;
Run;
7
2. METHODOLOGY
a. Exploratory Data Analysis
SAS Code:
**Perform EDA** ;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales;
Run;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales;
Run;
Proc Sort Data = Stores_1 ;
By State ;
Run;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales;
By State ;
Run;
8
2. METHODOLOGY
a. Exploratory Data Analysis
SAS Code:
**Perform EDA** ;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales;
By State ;
Run;
Proc FREQ Data = Stores_1 ;
Table State ;
Run ;
9
METHODOLOGY
a. Exploratory Data Analysis
State=KA
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81
PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84
PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52
PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84
Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00
Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00
Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83
State=TN
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26
PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24
PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32
PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18
Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00
Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00
Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51
A State-wise analysis of the variables reveals more or less the same patterns for both the states, KA & TN.
Category 1 remains the dominating category across both the states
Although the average size of stores in both states is roughly the same, a comparison of the minimum store size in both the states shows that there are a few smaller stores in
state TN as compared to state KA. 10
METHODOLOGY
a. Exploratory Data Analysis
State=KA
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81
PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84
PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52
PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84
Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00
Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00
Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83
State=TN
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26
PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24
PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32
PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18
Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00
Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00
Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51
The ranking of the four category of products at these stores remains the same in both the states i.e., Sales of Cat1 > Cat2 > Cat4 > Cat3
The mean sale in state TN is higher than that in state KA though not significantly. This is due a lower count of stores in TN as compared to KA as a result of which
TN has a slightly higher mean sales inspite of having lower sales overall.
The total count of stores in state KA is higher (55%) than that in state TN (45%).
The total volume of sales in state KA is higher than that in state TN which is on expected lines given the higher count of stores in KA as compared to TN.
It may therefore be inferred that there are possibly a few stores in state TN that are smaller than the mean size of stores in both states and the average sale per sq. foot in these stores is high.
The average sale per sq. foot is roughly the same in both the states.
11
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
12
2. METHODOLOGY
b. Data Preparation
i. Scaling
The data was scaled i.e., the following variables were normalised in order to bring them to a comparable level:
a. PCAT1
b. PCAT2
c. PCAT3
d. PCAT4
e. Avg_Sales
SAS Code:
**SCALING in order to standardize the variables** ;
Proc Standard Data = Stores_1 Mean = 0 Std = 1 Out = Store_2;
Var PCAT1-PCAT4 Avg_Sales;
Run;
13
2. METHODOLOGY
b. Data Preparation
ii Weighting
The variable ‘Avg Sales Per Sq. Foot’ was weighted with several iterations as follows:
Summary of the results of the weighting iterations performed above:
(Detailed results for all the iterations performed above are available on the path: ‘Y:Assignment - ClusteringWeighting’)
Iteration # Weight Assigned
1 2
2 3
3 4
4 5
Cluster Summary: Iteration 1 W=2
Cluster Frequency
RMS Std
Deviation
Maximum Distance
from Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 127 0.8789 4.2013 3 3.3953 3.86
2 3 1.1296 2.3982 1 7.4713 6.61
3 184 0.7815 3.2693 4 2.5422 3.25
4 201 0.9511 4.6159 3 2.5422 2.67
Cluster Summary: Iteration 2 W=3
Cluster Frequency
RMS Std
Deviation
Maximum
Distance from
Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 218 0.9394 4.5008 2 3.0989 3.30
2 212 1.0469 3.97 1 3.0989 2.96
3 82 0.9892 4.6592 1 4.5079 4.56
4 3 1.2361 2.6172 3 10.062 8.14
14
2. METHODOLOGY
b. Data Preparation
Cluster Summary: Iteration 3 W=4
Cluster Frequency
RMS Std
Deviation
Maximum Distance
from Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 3 1.3713 2.8962 4 13.3713 9.75
2 226 1.1425 4.8821 3 4.0108 3.51
3 205 0.9991 4.4853 2 4.0108 4.01
4 81 1.1858 5.6984 3 5.8659 4.95
Cluster Summary: Iteration 4 W=5
Cluster Frequency
RMS Std
Deviation
Maximum
Distance from
Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 202 1.0857 4.4648 2 4.95 4.56
2 229 1.241 5.8923 1 4.95 3.99
3 3 1.5277 3.2196 4 16.71 10.94
4 81 1.4006 6.8317 1 7.28 5.20
The Ratio mentioned above has been calculated using the Difference in Centroids (M) method where:
M = D / d1
D = Average distance b/w cluster centroids
d1 = Average distance b/w cluster members and centroid
15
2. METHODOLOGY
b. Data Preparation
SAS Code:
*1. Iteration 1 : Weight = 2* ;
Data Store_3 ;
Set Store_2 ;
Avg_Sales2 = Avg_Sales*2 ;
Run;
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_3 Out = Cluster_1 Maxclusters = 4 Converge = 0 Maxiter = 20 ;
Var PCAT1-PCAT4 Avg_Sales2;
Run;
*2. Iteration 2 : Weight = 3* ;
Data Store_4 ;
Set Store_3 ;
Avg_Sales3 = Avg_Sales*3 ;
Run;
16
2. METHODOLOGY
b. Data Preparation
SAS Code:
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_4 Out = Cluster_2 Maxclusters = 4 Converge = 0 Maxiter = 20 ;
Var PCAT1-PCAT4 Avg_Sales3;
Run;
*3. Iteration 3 : Weight = 4* ;
Data Store_5 ;
Set Store_4 ;
Avg_Sales4 = Avg_Sales*4 ;
Run;
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_5 Out = Cluster_3 Maxclusters = 4 Converge = 0 Maxiter = 20 ;
Var PCAT1-PCAT4 Avg_Sales4;
Run; 17
2. METHODOLOGY
b. Data Preparation
SAS Code:
*3. Iteration 4 : Weight = 5* ;
Data Store_6 ;
Set Store_5 ;
Avg_Sales5 = Avg_Sales*5 ;
Run;
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_6 Out = Cluster_4 Maxclusters = 4 Converge = 0 Maxiter =
20 ;
Var PCAT1-PCAT4 Avg_Sales5;
Run;
18
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
19
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS) : Creation of Preliminary Clusters
For detailed results of the Preliminary Cluster analysis and dignostic plots, please refer to the path: Y:Assignment -
ClusteringPreliminary_Analysis_Outliers.xlsx
Cluster Summary
Cluster Frequency RMS Std Deviation
Maximum Distance from
Seed to Observation Radius Exceeded Nearest Cluster
Distance Between
Cluster Centroids
1 14 0.7555 2.7913 14 2.1932
2 32 0.6678 3.2557 5 1.9093
3 1. 0 11 4.3786
4 60 0.8155 3.286 19 2.2772
5 28 0.6731 2.5255 2 1.9093
6 61 0.7495 2.7836 19 2.6713
7 1. 0 13 4.3876
8 67 0.7811 2.8286 10 2.2277
9 42 0.7811 2.4529 4 2.8634
10 46 0.6996 2.5278 8 2.2277
11 29 0.7186 2.5871 5 2.2318
12 1. 0 13 3.9468
13 1. 0 12 3.9468
14 28 0.6919 2.5985 1 2.1932
15 5 0.6989 1.9953 18 2.3899
16 21 0.6852 2.6402 18 2.1146
17 27 0.6957 2.2399 5 2.198
18 9 0.6001 1.7932 16 2.1146
19 29 0.7757 2.7927 4 2.2772
20 13 0.6923 2.3191 16 2.3297
Hence, Cluster # 3, 7, 12 and 13 appear as outliers with only single observation in each.
The remaining clusters appear to be reasonably sized.
20
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS)
The following are the details of the clusters that have been identified as outliers: Detection of Outliers
Store_Num CLUSTER
36 3
225 7
360 12
179 13
Cluster=3
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 33.39 38.42 8.80 0.57
PCAT2 12.32 24.93 8.26 1.53
PCAT3 33.07 13.77 4.88 3.96
PCAT4 21.22 22.88 4.71 0.35
Avg_Sales 0.22 0.21 0.05 0.26
Cluster=7
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 40.14 38.42 8.80 0.20
PCAT2 32.03 24.93 8.26 0.86
PCAT3 4.76 13.77 4.88 1.85
PCAT4 23.08 22.88 4.71 0.04
Avg_Sales 0.45 0.21 0.05 4.50
Cluster=12
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 40.83 38.42 8.80 0.27
PCAT2 31.60 24.93 8.26 0.81
PCAT3 15.47 13.77 4.88 0.35
PCAT4 12.09 22.88 4.71 2.29
Avg_Sales 0.50 0.21 0.05 5.50
Cluster=13
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 45.55 38.42 8.80 0.81
PCAT2 15.66 24.93 8.26 1.12
PCAT3 20.11 13.77 4.88 1.30
PCAT4 18.68 22.88 4.71 0.89
Avg_Sales 0.47 0.21 0.05 4.91
21
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
Store_Num CLUSTER Cat1 Cat2 Cat3 Cat4 Size Sale State Avg_Sales PCAT1 PCAT2 PCAT3 PCAT4
36 3 214 79 212 136 2860 641KA 224.13 33.39 12.32 33.07 21.22
225 7 287 229 34 165 1600 715KA 446.88 40.14 32.03 4.76 23.08
360 12 314 243 119 93 1540 769TN 499.35 40.83 31.60 15.47 12.09
179 13 256 88 113 105 1200 562TN 468.33 45.55 15.66 20.11 18.68
• Average size of all the stores in the data set is 2933 sq. feet. Thus for store # 225, 360 & 179 the size is considerably less.
• Avg Sales Per Sq. Foot for all the stores is 210 whereas for Store # 225, 360 & 179 it is more than double the overall mean
avg sales per sq. foot. This is due to the smaller size of these stores as compared to the size of all other stores.
• For Store # 36 the sales from CAT3 has a percentage share 33% of the total sales from that store. Whereas, the average
percentage share of CAT3 in all the stores is appox 14%
22
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers- Diagnostic Plots
23
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers-Diagnostic Plots
24
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
**Performing the Clustering procedure using K-Means with iterations to determine the optimal no. of clusters** ;
*Conducting a preliminary cluster analysis to detect outliers, if any* ;
Proc Fastclus Data = Store_6 Out = Cluster_Prelim Maxclusters = 20 Converge = 0 Outstat=Stat_Prelim_0;
Var PCAT1-PCAT4 Avg_Sales5;
Run;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run; 25
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
*Preparation of data set obtained from merging procedures in order to make a cluster wise analysis of
the outliers, if any* ;
Proc Sort Data = Cluster_Prelim ;
By Cluster;
Run;
Data Cluster_Pre_1 ;
Set Cluster_Prelim ;
Keep Store_Num Cluster ;
Run;
26
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
Proc Export Data = Cluster_Pre_1 outfile = 'Y:Assignment - ClusteringCluster_Pre_1.csv'
DBMS=CSV Replace ;
Run;
*Merging data set named Cluster_Pre_1 with data set Stores_1* ;
Proc Sort Data = Cluster_Pre_1 ;
By Store_Num ;
Run;
Proc Sort Data = Stores_1 ;
By Store_Num;
Run;
Data Store_1_Merged ;
Merge Cluster_Pre_1 (in=a) Stores_1 (in=b) ;
By Store_Num ;
If a and b ;
Run;
Proc Export Data = Store_1_Merged Outfile = 'Y:Assignment - ClusteringStore_1_Merged.csv'
DBMS = CSV Replace ;
Run; 27
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
Proc Sort Data = Store_1_Merged ;
By Cluster ;
Run;
Proc Means Data = Store_1_Merged Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ;
By Cluster ;
Where Cluster IN(3,7,12,13) ;
Run;
Proc Means Data = Stores_1 Mean Std ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ;
Run;
Proc Means Data = Stores_1 Mean ;
Var Size Avg_Sales ;
Run;
28
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
An alternative approach for detection and treatment of outliers was attempted.
The following are the steps that were undertaken for the process of detection and treatment of outliers:
STEP 1: Run Proc FASTCLUS with many clusters and OUTSEED = output data set for diagnostic plot
(Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step1_Mean1.xlsx’)
29
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
STEP 2: Remove low frequency clusters
The data set, MEAN1, generated in the above step was used to remove low frequency clusters ( < 5) and clusters with a
frequency of 5 or more were retained for subsequent analysis.
The data set with clusters having 5 or more frequency was named as 'Seed1'.
STEP 3: Proc FASTCLUS was run again selecting seeds from high frequency clusters obtained in data set SEED1 in Step 2
above using LEAST = 1 Clustering Criterion
Value for LEAST should be < 2 in order to reduce the effect of outliers on cluster centers
(Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step3_LEAST.xlsx’)
STEP 4: Proc FASTCLUS was run again selecting seeds from high frequency clusters in previous analysis with STRICT=3
preventing outliers from distorting the results
Value of STRICT = 3 was chosen to be close to _GAP_ & _RADIUS_ values of the larger clusters in the diagnostic plots.
30
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
However, STRICT option is not supported in WPS for Proc FASTCLUS in the present version.
Subsequently, a final Proc FASTCLUS could not be run to assign outliers and tails to clusters using seeds that would have been
generated from using STRICT option above.
SAS Code:
***Another method for identification and treatment of outliers*** ;
*STEP 1 : Run PROC FASTCLUS with many clusters and OUTSEED = output data set for
diagnostic plot*;
Proc Fastclus Data = Store_6 Outseed = Mean1 Maxclusters = 20 Maxiter = 0 Summary ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Axis1 Label = (Angle=90 Rotate=0) Minor=None Order=(0 to 10 by 2) ;
Axis2 minor = None ;
Proc Gplot Data = Mean1 ;
Plot _GAP_*_FREQ_ _RADIUS_*_FREQ_ / Overlay Frame
cframe = ligr vaxis = axis1 haxis=axis2 legend= legend1 ;
Run; 31
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
SAS Code:
*Step 2 :Remove Low Frequency clusters* ;
Data Seed1 ;
Set Mean1 ;
If _FREQ_ >=5 ;
Run;
*Step 3 : Run Proc Fastclus again selecting seeds from high frequency clusters in previous analysis using
LEAST = 1 Clustering Criterion since value < 2 reduce the effect of outliers on cluster centers* ;
Proc FASTCLUS Data = Store_6 Seed = Seed1 Maxclusters = 8 Least = 1 Out = Store_6_Least ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Legend1 Frame Cframe = ligr Label = None CBorder = Black
Position=Center Value= (Justify=Center) ;
Axis1 Label =(Angle=90 Rotate=0) Minor=None ;
Axis2 Minor=None ;
Proc Gplot Data = Store_6_Least ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
32
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approachSAS Code:
Proc Gplot Data = Store_6_Least ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Store_6_Least ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Store_6_Least ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
*Step 4 : Run Proc Fastclus again, selecting seeds from high frequency clusters in previous analysis with STRICT = to
prevent the outliers from distorting the
results*
*Value of STRICT = is chosen to be close to the _GAP_ & _RADIUS_ values of the large clusters in the diagnostic plot* ;
Proc Fastclus Data = Store_6 Seed = Seed1 Maxclusters = 8 Strict=3 out = Store_6_Strict Outseed = Mean2 ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run; 33
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
Performing iterations for determining the appropriate no. of clusters using K-Means (PROC FASTCLUS)
From the procedure run in Step 1 of the alternative method discussed in the preceeding slide for outlier detection, it was found that
8 and above could be a good no. for meaningful cluster formation.
Hence, the iterations below begin with Maxclusters = 8-10
Clustering is performed on the data set from which the outliers have been removed.
Iteration 1 : Maxclusters = 8
Statistic used for comparison
Name (as it
appears in the
output) Value_Iteration1
Pseudo F Stat Pseudo F Statistics 451.71
Appox. Expected overall R^2 Overall R-Square 0.84
Detailed output and plots on path Y:Assignment -
ClusteringIteration_1_Maxclust_8.xlsx
Iteration 2 : Maxclusters = 9
Statistic used for comparison
Name (as it
appears in the
output) Value_Iteration2
Pseudo F Stat Pseudo F Statistics 420.29
Appox. Expected overall R^2 Overall R-Square 0.87
Detailed output and plots on path Y:Assignment -
ClusteringIteration_2_Maxclust_9.xlsx
Iteration 3 : Maxclusters = 10
Statistic used for comparison
Name (as it
appears in the
output) Value_Iteration3
Pseudo F Stat Pseudo F Statistics 391.10
Appox. Expected overall R^2 Overall R-Square 0.88
Detailed output and plots on path Y:Assignment -
ClusteringIteration_3_Maxclust_10.xlsx
Points considered for a comparison of the above 3 iterations:
1 Relatively large values of Pseudo F Stat indicate a stopping point
2 Higher values of overall R-Square are desirable
3 Increasing the no. of clusters although not much differentiation exists amongst the
iterations means devising more marketing strategies unique to each cluster.
Given a cost vs. benefit analysis, it is preferable to have a smaller no. of clusters.
Hence, iteration 2 wherein 9 clusters are formed seems most appropriate in the
present case. 34
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
*Deleting the outliers found (in the procedures above) from the scaled and weighted data set* ;
Data Store_6_Final ;
Set Store_6 ;
If Store_Num IN(36 179 225 360) Then Delete ;
Run;
*Iteration 1 : Maxclusters = 8 * ;
Proc FastClus Data = Store_6_Final Maxclusters = 8 Maxiter= 20 Converge = 0 Out=Clusters_8 ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
*Generating Plots of clusters*;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
35
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
Proc Gplot Data = Clusters_8 ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_8 ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_8 ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_8 ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
36
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
*Iteration 2 : Maxclusters = 9 * ;
Proc FastClus Data = Store_6_Final Maxclusters = 9 Maxiter= 20 Converge = 0 Mean= Mean_Clusters_9 Out=Clusters_9
;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
*Generating Plots of clusters*;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
Proc Gplot Data = Clusters_9 ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
37
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
Proc Gplot Data = Clusters_9 ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_9 ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_9 ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
*Iteration 3 : Maxclusters = 10 * ;
Proc FastClus Data = Store_6_Final Maxclusters = 10 Maxiter= 20 Converge = 0 Out=Clusters_10 ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
38
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
*Generating Plots of clusters*;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
Proc Gplot Data = Clusters_10 ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_10 ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
39
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
Proc Gplot Data = Clusters_10 ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_10 ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
*Merging the data sets for analysis of the final clusters formed* ;
Proc Sort Data = Stores_1 ;
By Store_Num ;
Run;
Data Stores_1_Final ;
Set Stores_1 ;
If Store_Num IN(36 179 225 360) Then Delete ;
Run;
40
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
Data Cluster_9_Final ;
Set Clusters_9 ;
Keep Store_Num Cluster ;
Run;
Proc Sort Data = Cluster_9_Final ;
By Store_Num ;
Run;
Data Stores_1_Final_Merged ;
Merge Stores_1_Final (in=a) Cluster_9_Final (in=b);
By Store_Num ;
If a and b ;
Run;
41
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
42
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters
Cluster Summary
Cluster Frequency RMS Std Deviation
Maximum
Distance from
Seed to
Observation Radius Exceeded
Nearest
Cluster
Distance
Between Cluster
Centroids Ratio
1 46 0.8678 3.424 3 3.3888 3.91
2 57 0.9084 3.2855 6 3.7655 4.15
3 83 0.819 2.9486 5 2.8481 3.48
4 41 0.7917 2.8507 6 2.8616 3.61
5 99 0.8254 2.6116 3 2.8481 3.45
6 67 0.773 3.0078 4 2.8616 3.70
7 81 0.7999 2.9993 4 2.9345 3.67
8 7 0.8229 2.68 9 3.2939 4.00
9 30 0.7436 2.6002 8 3.2939 4.43
• Ratio has been calculated using the ‘Difference in Centroids’ method as D / d1 where:
D = Average distance b/w cluster centroids
d1 = Average distance b/w members and cluster centroid
• Thus, the ratio signifies the strength of the clusters formed and is a measure of the
homogeneity within compared to the heterogeneity outside
• Cluster 9 is the strongest of all other cluster formations followed by Cluster 2 & Cluster 8
43
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters
The 9 clusters obtained in the preliminary cluster analysis have been evaluated and profiled as under in order to gain insights
into the variables that are most dominating in the cluster formation:(Detailed output on path ‘Y:Assignment –
ClusteringIteration_2_Maxclust_9.xlsx)
Cluster=1
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 46 39.92 38.41 8.82 0.17
PCAT2 46 27.07 24.95 8.25 0.26
PCAT3 46 12.58 13.73 4.80 0.24
PCAT4 46 20.42 22.91 4.70 0.53
Avg_Sales_Final 46 272.14 208.80 48.77 1.30
Cluster=2
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 57 33.41 38.41 8.82 0.57
PCAT2 57 20.16 24.95 8.25 0.58
PCAT3 57 16.95 13.73 4.80 0.67
PCAT4 57 29.49 22.91 4.70 1.40
Avg_Sales_Final 57 142.44 208.80 48.77 1.36
Cluster=3
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 83 39.70 38.41 8.82 0.15
PCAT2 83 26.23 24.95 8.25 0.16
PCAT3 83 13.58 13.73 4.80 0.03
PCAT4 83 20.49 22.91 4.70 0.52
Avg_Sales_Final 83 236.59 208.80 48.77 0.57
Cluster=4
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 41 31.45 38.41 8.82 0.79
PCAT2 41 20.68 24.95 8.25 0.52
PCAT3 41 21.16 13.73 4.80 1.55
PCAT4 41 26.71 22.91 4.70 0.81
Avg_Sales_Final 41 183.11 208.80 48.77 0.53
Cluster=5
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 99 37.08 38.41 8.82 0.15
PCAT2 99 28.40 24.95 8.25 0.42
PCAT3 99 12.62 13.73 4.80 0.23
PCAT4 99 21.90 22.91 4.70 0.22
Avg_Sales_Final 99 207.18 208.80 48.77 0.03
Cluster=6
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 67 31.57 38.41 8.82 0.78
PCAT2 67 33.61 24.95 8.25 1.05
PCAT3 67 10.78 13.73 4.80 0.61
PCAT4 67 24.04 22.91 4.70 0.24
Avg_Sales_Final 67 173.18 208.80 48.77 0.7344
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
Cluster=7
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 81 49.76 38.41 8.82 1.29
PCAT2 81 15.03 24.95 8.25 1.20
PCAT3 81 12.69 13.73 4.80 0.22
PCAT4 81 22.52 22.91 4.70 0.08
Avg_Sales_Final 81 183.23 208.80 48.77 0.52
Cluster=8
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 7 39.64 38.41 8.82 0.14
PCAT2 7 26.63 24.95 8.25 0.20
PCAT3 7 13.52 13.73 4.80 0.04
PCAT4 7 20.21 22.91 4.70 0.58
Avg_Sales_Final 7 351.02 208.80 48.77 2.92
Cluster=9
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 30 40.27 38.41 8.82 0.21
PCAT2 30 28.77 24.95 8.25 0.46
PCAT3 30 12.71 13.73 4.80 0.21
PCAT4 30 18.25 22.91 4.70 0.99
Avg_Sales_Final 30 316.83 208.80 48.77 2.21
45
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
1. 7% of the total stores have their avg sales per sq. foot significantly higher than the overall average.
2. 11% of the total stores have significantly higher than overall average sales in the category of Tobacco &
Alcohol.
3. 16% of the total stores have lower than the overall average sales in the category of Tobacco & Alcohol
though the difference is not significant.
4. 32% of the total stores have higher than overall average sales in the category of Frozen Foods. Average
sales in Cluster 6 for the category of Frozen Foods is significantly higher than the overall mean sales for
the same category.
Cluster # No. of stores
8 7
9 30
Cluster # No. of stores
2 57
Cluster # No. of stores
3 83
Cluster # No. of stores
5 99
6 67
46
The FREQ Procedure
Table of CLUSTER by State
CLUSTER (Cluster) State (State) Total
KA TN
Frequency
Percent
1 30 16 46
5.87 3.13 9
2 33 24 57
6.46 4.7 11.15
3 44 39 83
8.61 7.63 16.24
4 25 16 41
4.89 3.13 8.02
5 49 50 99
9.59 9.78 19.37
6 37 30 67
7.24 5.87 13.11
7 43 38 81
8.41 7.44 15.85
8 3 4 7
0.59 0.78 1.37
9 16 14 30
3.13 2.74 5.87
Total
280 231 511
54.79 45.21 100
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
The overall bias in the number of stores is towards the state KA with 55% of the total stores being in KA.
No other significant pattern in the distribution of stores has emerged.
47
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
Analysis Variable: Size
Cluster N Obs N Mean
Population
Mean
Population
Std Dev Z-Score Minimum Maximum
1 46 46 2471.52 2942.32 423.52 1.11 1700 2910
2 57 57 3334.74 2942.32 423.52 0.93 2550 3650
3 83 83 2761.39 2942.32 423.52 0.43 1925 3330
4 41 41 3040.98 2942.32 423.52 0.23 2180 3650
5 99 99 2985.56 2942.32 423.52 0.10 2000 3610
6 67 67 3184.63 2942.32 423.52 0.57 2200 3630
7 81 81 3172.10 2942.32 423.52 0.54 2600 3650
8 7 7 1977.14 2942.32 423.52 2.28 1550 2150
9 30 30 2205.33 2942.32 423.52 1.74 1750 2520
Appox. 7% of all the stores have a mean size significantly lower than the overall size of all the stores.
The split of these stores b/w the two states is roughly the same and there is no discerning pattern
observed.
48
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9
clusters)
49
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9
clusters)
50
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters)
SAS Code:
*Profiling the 9 clusters obtained in the preceeding procedures* ;
Proc Sort Data = Stores_1_Final_Merged ;
By Cluster ;
Run;
Data Stores_1_Final_Merged ;
Set Stores_1_Final_Merged;
Avg_Sales_Final = Avg_Sales * 1000 ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
By Cluster ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean Std;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run; 51
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters)
SAS Code:
Proc Means Data = Stores_1_Final_Merged N ;
Class State ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run;
Proc Freq Data = Stores_1_Final_Merged ;
Tables Cluster * State / nocol norow ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean ;
Var Size ;
Class Cluster ;
Run;
Proc Means Data = Stores_1_Final_Merged Mean Std ;
Var Size ;
Run;
52
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
53
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER)
With 9 clusters, treated for outliers, obtained from the preliminary cluster analysis using PROC FASTCLUS procedure (K-Means
Method for Clustering), Hierarchial Clustering is performed next using the PROC CLUSTER procedure to obtain the final no. of
clusters.
The following methods are used for Hierarchial Clustering:
Note:
K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis.
Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set.
Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9.
S.No Method # of Clusters obtained Remarks
1 Ward's Method 3
The scatter diagram of the clusters obtained revealed cluster formations that were not well
demarcated
Also, the profiling of these 3 clusters didn't reveal any variable that was dominant in the formation of the
clusters.
For detailed results, refer to the tab named 'Output_Wards' &
'Output_Final_Profiling_W'
2 Density Method Ties were observed while Density method was used. Based on the position of the Ties in the Cluster History,
the clusters obtained when K=7 were finalized.
K=7 5 For detailed results, refer to the tab named 'Output_Density_K7' & 'Output_Final_Profiling_D'
K=8 4 For detailed results, refer to the tab named 'Output_Density_K8'.
K=9 5 For detailed results, refer to the tab named 'Output_Density_K9'.
54
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’ (refer to tab named ‘Output_Wards’)
Cluster History
Number Of Clusters First Cluster Joined
Second
Cluster
Jointed Frequency Of New Cluster Semipartial RSq RSquared Pseudo F Statistic Pseudo t-squared Approximate Expected RSq Cubic Clustering Criteria Tie
8 8 9 37 0.0054 0.9913,000 . . .
7 4 6 108 0.0184 0.98 3436. . .
6 1 3 129 0.0301 0.95 1772. . .
5 CL7 7 189 0.0322 0.91 1343 327. .
4 CL5 5 288 0.0438 0.87 1131 248. .
3 2 CL4 345 0.0952 0.77 874 346. .
2 CL6 CL8 166 0.1266 0.65 938 585. .
1 CL2 CL3 511 0.6483 0. 938 0 0
# of clusters according to:
Pseudo T-Square: 3, 2
Semipartial R-Square: 8,7,6,5,4,3
Therefore, final # of clusters considered on the basis of the results of Ward's Method = 3
The Cluster History, from the Ward’s method, is as below:
55
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The Tree diagram, from the Ward’s method, is as below:
56
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The following are the plots obtained from the Ward’s method:
57
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The following are the plots obtained from the Ward’s method:
58
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The 3 clusters obtained from the Ward’s method have been profiled as below:
Analysis Variable: Cluster_Final
Cluster_Final N Obs N
1 103 103
2 242 242
3 166 166
Cluster_Final=1
Variable N Mean Popltn Mean Popltn Std. Dev Z-Score
PCAT1 103 36.32 38.41 8.82 0.24
PCAT2 103 23.25 24.95 8.25 0.21
PCAT3 103 15.00 13.73 4.80 0.26
PCAT4 103 25.44 22.91 4.70 0.54
Avg_Sales_Final 103 200.36 208.80 48.77 0.17
Cluster_Final=2
Variable N Mean Popltn Mean Popltn Std. Dev Z-Score
PCAT1 242 41.74 38.41 8.82 0.38
PCAT2 242 21.87 24.95 8.25 0.37
PCAT3 242 14.46 13.73 4.80 0.15
PCAT4 242 21.94 22.91 4.70 0.21
Avg_Sales_Final 242 222.93 208.80 48.77 0.29
Cluster_Final=3
Variable N Mean Popltn Mean Popltn Std. Dev Z-Score
PCAT1 166 34.85 38.41 8.82 0.40
PCAT2 166 30.50 24.95 8.25 0.67
PCAT3 166 11.88 13.73 4.80 0.39
PCAT4 166 22.76 22.91 4.70 0.03
Avg_Sales_Final 166 193.45 208.80 48.77 0.31
Table of Cluster_Final by State
Cluster_Final State (State) Total
KA TN
Frequency
Percent
1 63 40 103
12.33 7.83 20.16
2 131 111 242
25.64 21.72 47.36
3 86 80 166
16.83 15.66 32.49
Total 280 231 511
54.79 45.21 100
Analysis Var_Size
Cluster_Final Mean Size
Popltn
Mean
Popltn Std.
Dev Z-Score
1 2949.22 2942.32 423.52 0.016
2 2854.61 2942.32 423.52 0.207
3 3065.9 2942.32 423.52 0.292
Conclusion:
• Thus, both the graphical plots as well as the summary
stats of the 3 clusters obtained using the Ward’s method
reveal no clear cluster formation.
• As such, no particular variable has been found
dominating in any of the 3 cluster formations.
59
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
SAS Code:
**Hierarchial Clustering procedure being performed on the 9 preliminary clusters obtained using K-Means** ;
**The data set using which K-Means clustering was performed to obtain the preliminary 9 clusters has been treated for
outliers and hence doesn't contain any outliers** ;
*Ward's Method* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_W Method = Ward CCC Pseudo ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Copy Cluster ;
ID Cluster ;
Run;
Proc Tree Data = Tree_9_W Horizontal Lines=(color=blue)
out = Tree_Out_9_W nclusters = 3 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
ID Cluster ;
Run;
Proc Print Data = Tree_Out_9_W ;
Run; 60
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
SAS Code:
*Profiling of the Clusters formed using Ward's Method* ;
*Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary
cluster analysis have been mapped to the
final 3 clusters obtained by the Ward's method* ;
Data Stores_Final_Analysis_W ;
Set Stores_1_Final_Merged ;
If Cluster = 1 OR Cluster = 2 Then Cluster_Final_W = 1 ;
Else If Cluster = 5 OR Cluster = 6 Then Cluster_Final_W = 3 ;
Else If Cluster = 3 OR Cluster= 4 OR Cluster=7 OR Cluster= 8 OR Cluster= 9 Then Cluster_Final_W = 2 ;
Run ;
Proc Sort Data = Stores_Final_Analysis_W ;
By Cluster_Final_W ;
Run;
Proc Means Data = Stores_Final_Analysis_W N;
Var Cluster_Final_W;
Class Cluster_Final_W ;
Run; 61
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
SAS Code:
Data Stores_Final_Analysis_W ;
Set Stores_Final_Analysis_W;
Avg_Sales_Final = Avg_Sales * 1000 ;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
By Cluster_Final_W ;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean Std;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run;
Proc Freq Data = Stores_Final_Analysis_W ;
Tables Cluster_Final_W*State / nocol norow nocum;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean ;
Var Size ;
By Cluster_Final_W ;
Run;
62
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
SAS Code:
Proc Means Data = Stores_Final_Analysis_W Mean Std ;
Var Size ;
Run;
Legend1 Frame Cframe = ligr cborder=black
position=center value=(justify=center) ;
Axis1 label=(angle=90 rotate=0) minor=none ;
Axis2 minor=none ;
Proc Gplot ;
Plot PCAT1 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot ;
Plot PCAT2 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot ;
Plot PCAT3 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot ;
Plot PCAT4 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
63
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The output for the Density method discussed below and in the following slides is when K=7.
(Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’. Refer tab named ‘Output_Density_K7’. For output when
K=8 & K=9 refer tab named ‘Output_Density_K8’ & ‘Output_Density_K9’.)
Cluster History
Number Of
Clusters
First Cluster
Joined
Second Cluster
Jointed
Frequency Of
New Cluster
Semipartial
RSq RSquared
Pseudo F
Statistic
Pseudo t-
squared
Approximate
Expected RSq
Cubic
Clustering
Criteria
Normalized
Fusion
Density
Lesser
Density
Greater
Density Tie
8 3 5 182 0.0324 0.97 2147. . . 61.799 44.7166 100
7 CL8 7 263 0.0826 0.89 647 665. . 38.79 24.0617 100
6 CL7 1 309 0.1255 0.76 319 335. . 35.798 21.8011 100 T
5 CL6 4 350 0.0541 0.71 303 78.3. . 35.798 21.8011 100
4 CL5 6 417 0.0911 0.61 269 128. . 26 14.9422 100
3 CL4 2 474 0.1869 0.43 190 229. . 7.2274 3.7492 100
2 CL3 9 504 0.3124 0.12 66.2 274. . 6.0544 3.1217 100
1 CL2 8 511 0.1151 0 . 66.2 0 0 2.1174 1.07 100
# of clusters according to:
Pseudo T-Square: 5, 4
Semipartial R-Square: 8,7,5,4
Therefore, final # of clusters considered in this iteration = 5
Since the Tie occurs in the early history of the cluster formation, it should have only a little effect on the later
stages and hence can be overlooked. 64
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The Tree diagram, from the Density method when K=7, is as below:
65
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The following are the plots obtained from the Density method when K=7:
66
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The following are the plots obtained from the Density method when K=7:
67
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The following elaborates on the profiles of the final 5 clusters obtained from the Density method:
Analysis Variable: Cluster_Final_D
Cluster_Final_D N Obs N
1 326 326
2 67 67
3 81 81
4 7 7
5 30 30
Cluster_Final_D=1
Variable N Mean Popltn Mean
Popltn Std
Dev Z-Score
PCAT1 326 36.80 38.41 8.82 0.18
PCAT2 326 25.25 24.95 8.25 0.04
PCAT3 326 14.69 13.73 4.80 0.20
PCAT4 326 23.26 22.91 4.70 0.07
Avg_Sales_Final 326 209.48 208.80 48.77 0.01
• No particular variable has emerged as a dominating
variable responsible for the formation of this cluster.
• Mean values of the variables in this cluster are very
near to the overall mean scores of the variables in the
data set.
Legend:
Cat1 Fresh Foods
Cat2 Frozen Foods
Cat3 Health & Beauty
Cat4 Tobacco & Alcohol
Cluster_Final_D=2
Variable N Mean Popltn Mean
Popltn Std
Dev Z-Score
PCAT1 67 31.57 38.41 8.82 0.78
PCAT2 67 33.61 24.95 8.25 1.05
PCAT3 67 10.78 13.73 4.80 0.61
PCAT4 67 24.04 22.91 4.70 0.24
Avg_Sales_Final 67 173.18 208.80 48.77 0.73
PCAT2 has emerged as a dominating variable and is the
most determining variable in the formation of this cluster
with nearly 13% of the total no. of stores having a mean
higher than 25%.
68
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
Cluster_Final_D=3
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 81 49.76 38.41 8.82 1.29
PCAT2 81 15.03 24.95 8.25 1.20
PCAT3 81 12.69 13.73 4.80 0.22
PCAT4 81 22.52 22.91 4.70 0.08
Avg_Sales_Final 81 183.23 208.80 48.77 0.52
Both PCAT1 and PCAT2 have emerged as the
dominating variables in Cluster 1 with nearly 16%
of the total no. of stores having a mean higher
than the overall mean of these 2 categories.
Cluster_Final_D=4
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 7 39.64 38.41 8.82 0.14
PCAT2 7 26.63 24.95 8.25 0.20
PCAT3 7 13.52 13.73 4.80 0.04
PCAT4 7 20.21 22.91 4.70 0.58
Avg_Sales_Final 7 351.02 208.80 48.77 2.92
Avg Sales per Sq. Foot has emerged as the dominating
variable in Cluster 4 with mean avg sales per sq. foot
significantly higher than the mean overall avg sales per sq.
foot with nearly 1.4% of the total no. of stores having a
mean greater than the overall mean of avg sales per sq.
foot.Cluster_Final_D=5
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 30 40.27 38.41 8.82 0.21
PCAT2 30 28.77 24.95 8.25 0.46
PCAT3 30 12.71 13.73 4.80 0.21
PCAT4 30 18.25 22.91 4.70 0.99
Avg_Sales_Final 30 316.83 208.80 48.77 2.21
Avg Sales per Sq. Foot has emerged as the dominating
variable in Cluster 5 with mean avg sales per sq. foot
significantly higher than the mean overall avg sales per sq.
foot with nearly 6% of the total no. of stores having a mean
greater than the overall mean of avg sales per sq. foot.
69
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The FREQ Procedure
Table of Cluster_Final_D by State
Cluster_Final_D State (State) Total
KA TN
Frequency
Percent
1 181 145 326
35.42 28.38 63.8
2 37 30 67
7.24 5.87 13.11
3 43 38 81
8.41 7.44 15.85
4 3 4 7
0.59 0.78 1.37
5 16 14 30
3.13 2.74 5.87
Total 280 231 511
54.79 45.21 100
No specific pattern has emerged in the state-wise
analysis of the clusters formed.
Analysis Var_Size
Cluster_Final Mean Size
Popltn
Mean
Popltn Std.
Dev Z-Score
1 2923.97 2942.32 423.52 0.04
2 3184.63 2942.32 423.52 0.57
3 3172.1 2942.32 423.52 0.54
4 1977.14 2942.32 423.52 2.28
5 2205.33 2942.32 423.52 1.74
• The average size of the stores in cluster 4 is much lesser
than the overall average size of the stores in the given data
set.
• Hence, the avg sales per sq. foot for stores in this cluster is
also significantly higher than the overall average sales per
sq. foot.
70
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
*Density Method* ;
*K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis* ;
*Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set*;
*Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9* ;
*K = 7* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D7 Method = Density K=7 CCC Pseudo ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Copy Cluster ;
ID Cluster ;
Run;
Proc Tree Data = Tree_9_D7 Horizontal Lines=(color=blue)
out = Tree_Out_9_D7 nclusters=5 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
ID Cluster ;
Run;
Proc Print Data = Tree_Out_9_D7 ;
Run; 71
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
*Profiling of the Clusters formed using Density Method for K=7* ;
*Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary cluster
analysis have been mapped to the final 5 clusters obtained by the Density method* ;
Data Stores_Final_Analysis_D ;
Set Stores_1_Final_Merged ;
If Cluster = 1 OR Cluster = 2 OR Cluster = 3 OR Cluster = 4 OR Cluster = 5 Then Cluster_Final_D = 1 ;
Else If Cluster = 6 Then Cluster_Final_D = 2 ;
Else If Cluster = 7 Then Cluster_Final_D = 3 ;
Else If Cluster = 8 Then Cluster_Final_D = 4 ;
Else If Cluster = 9 Then Cluster_Final_D = 5 ;
Run ;
Proc Sort Data = Stores_Final_Analysis_D ;
By Cluster_Final_D ;
Run;
Proc Means Data = Stores_Final_Analysis_D N;
Var Cluster_Final_D;
Class Cluster_Final_D ;
Run;
72
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
Data Stores_Final_Analysis_D ;
Set Stores_Final_Analysis;
Avg_Sales_Final = Avg_Sales * 1000 ;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
By Cluster_Final_D ;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean Std;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run;
Proc Freq Data = Stores_Final_Analysis_D ;
Tables Cluster_Final_D*State / nocol norow nocum;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean ;
Var Size ;
By Cluster_Final_D ;
Run;
73
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
Proc Means Data = Stores_Final_Analysis_D Mean Std ;
Var Size ;
Run;
Proc Export Data = Stores_Final_Analysis_D Outfile = 'Y:Assignment - ClusteringStores_Final_Analysis_D.csv'
DBMS= CSV Replace ;
Run;
Legend1 Frame Cframe = ligr cborder=black
position=center value=(justify=center) ;
Axis1 label=(angle=90 rotate=0) minor=none ;
Axis2 minor=none ;
Proc Gplot ;
Plot PCAT1 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot ;
Plot PCAT2 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
74
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
Proc Gplot ;
Plot PCAT3 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot ;
Plot PCAT4 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
*K = 8* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D8 Method = Density K=8 CCC Pseudo ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Copy Cluster ;
Run;
Proc Tree Data = Tree_9_D8 Horizontal Lines=(color=blue)
out = Tree_Out_9_D8 nclusters=4 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Proc Print Data = Tree_Out_9_D8 ;
Run;
75
2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
• *K = 9* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D9 Method = Density K=9 CCC
Pseudo ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Copy Cluster ;
Run;
Proc Tree Data = Tree_9_D9 Horizontal Lines=(color=blue)
out = Tree_Out_9_D9 nclusters=5 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Proc Print Data = Tree_Out_9_D9 ;
Run;
76
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
77
3. SUMMARY OF INSIGHTS
1. 16% of the stores have their mean sales from the Fresh Food category higher than the overall average in this category.
2. 13% of the stores have their mean sales from the Frozen Food category higher than the overall average in this category.
3. 16% of the stores have their mean sales from the Frozen Food category lower than the overall average in this category.
4. The % sales from the category of Health & Beauty in all the clusters formed above is nearly around the overall mean sales of this category.
5. Only 5% of the total no. of stores have their mean sales from the category Tobacco & Alcohol lower than the overall mean sales of this category.
6. 7% of the total stores have their average sales per sq. foot significantly higher than the overall average. The difference is particularly more pronounced
for stores in Cluster 4 in which the average size of the stores is also much lesser than the overall average size.
7. 29% of the total stores have their average sales per sq. foot significantly lower than the overall average.
Cluster # No. of stores
3 81
Cluster # No. of stores
2 67
Cluster # No. of stores
3 81
Cluster # No. of stores
4 7
5 30
Cluster # No. of stores
2 67
3 81 78
CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
79
4. RECOMMENDATIONS
1. Cluster 3
a The size of stores in this cluster is higher than the average size of all stores though the difference is not significant.
b When compared to the overall mean sales of all stores from Fresh Foods category, the contribution to revenue from the Fresh Food Category
is highest from stores in this cluster.
c However, when compared with the overall mean sales of all stores from the category of Frozen Foods, the contribution to revenue from the Frozen Food category is
lowest from stores in this cluster.
d The average sales per square foot from stores in this cluster is also lower when compared with the overall average sales per sq. foot of all stores.
e The above observations therefore imply that although Fresh Foods category is contributing the most to the sales but perhaps this contribution is not enough to
increase the overall sales of the stores which are lesser than the average of all other stores despite a greater size of stores.
There is therefore a need to may be adopt techniques such as better placement of such products or a promotional campaign targeted specifically for products in this
category.
Strategies may also be devised for promoting sales from Frozen Food category as they are significantly lesser than the overall average sales of this
category in other stores.
One possibility is that sales from Fresh Foods category is cannibalizing the sales from Frozen Foods category and hence an alternative shelf placement is
required.
80
4. RECOMMENDATIONS
2. Cluster 2
a As compared to stores in Cluster 3, a contrasting situation is seen for stores in this cluster.
b
The sales from Frozen Food category are contributing the most to the overall revenue of stores in this cluster and are greater than the overall mean sales from this
category in all other stores
Whereas, sales from the Fresh Food category are lower than the overall mean sales from this category in other stores.
c The average size of stores in this cluster is roughly the same as the size of stores in Cluster 3 and is higher than the overall mean size of other stores.
d
Also, the average sales per sq. foot is lesser than the overall average sales per sq. foot of other stores. They are also lesser than the average sales per sq. foot of stores
in Cluster 3.
e
Hence, strategies similar to those to be adopted for stores in Cluster 3 may also be replicated for stores in Cluster 2 for promoting sales from both the Fresh Foods category as
well as the Frozen Foods category.
This may be done after gaining insights into the factors that are driving the Frozen Food sales in stores of Cluster 2 and Fresh Food sales in stores of Cluster 3.
81
4. RECOMMENDATIONS
3. Cluster 1
4. Cluster 4
a Stores in Cluster 1, roughly 64%, are highest in no. as compared to stores in other clusters.
b Sales from all 4 categories of products of stores in this cluster are very close to the overall mean sales of each of the four categories in all the stores.
c The average size of the stores in this cluster is also very close to the overall average size of all stores.
d
Since this cluster has the highest and a significant % of no. of stores, promotional activities adopted for all these stores can perhaps also help in
significantly increasing the overall sales volume of the Retailer X.
a
This cluster houses only 1% of the total stores with the only differentiating factor being the average sales per sq. foot which is significantly higher than the overall
average for other stores.
b The mean sales of products in each of the 4 categories is very similar to the overall mean sales of those categories.
c Hence, the only possible reason for a significantly higher average sales per sq. foot is the lower than overall average size of the stores.
No specific state-wise pattern has emerged for these stores with the distribution being fairly consistent in both the states, KA & TN.
82
4. RECOMMENDATIONS
5. Cluster 5
6.
a This cluster houses nearly 6% of the total stores
b The distribution of variables for stores in this cluster is almost similar to stores in Cluster 4.
c However, it may be noted that sales from the Tobacco & Alcohol category are lower than the overall mean sales of other stores from this category.
Having identified the drivers of sales in stores of each of the 5 clusters, it is next important to understand other factors that influence each of these
drivers.
Inclusion of demographic factors such as age, income, location, gender etc. as additional variables, may give better insights into the promotional strategies, unique to
each cluster, that may be adopted for increasing the sales.
83
END
84

More Related Content

Similar to Clustering

Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)nlt2390
 
A Method of Mining Association Rules for Geographical Points of Interest
A Method of Mining Association Rules for Geographical Points of InterestA Method of Mining Association Rules for Geographical Points of Interest
A Method of Mining Association Rules for Geographical Points of InterestNational Cheng Kung University
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkStats Statswork
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...pandavaTirumala
 
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...Fabricio de França
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_TrushitaTrushita Redij
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27IJARIIE JOURNAL
 
Supervised learning (2)
Supervised learning (2)Supervised learning (2)
Supervised learning (2)AlexAman1
 
Genetic Algo. for Radial Distribution System to reduce Losses
Genetic Algo. for Radial Distribution System to reduce LossesGenetic Algo. for Radial Distribution System to reduce Losses
Genetic Algo. for Radial Distribution System to reduce LossesAbhishek Jangid
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in RSudhakar Chavan
 

Similar to Clustering (20)

Thesis (presentation)
Thesis (presentation)Thesis (presentation)
Thesis (presentation)
 
A Method of Mining Association Rules for Geographical Points of Interest
A Method of Mining Association Rules for Geographical Points of InterestA Method of Mining Association Rules for Geographical Points of Interest
A Method of Mining Association Rules for Geographical Points of Interest
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
Af4201214217
Af4201214217Af4201214217
Af4201214217
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
MNIST 10-class Classifiers
MNIST 10-class ClassifiersMNIST 10-class Classifiers
MNIST 10-class Classifiers
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...
 
pradeep ppt final.pptx
pradeep ppt final.pptxpradeep ppt final.pptx
pradeep ppt final.pptx
 
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Machine_Learning_Trushita
Machine_Learning_TrushitaMachine_Learning_Trushita
Machine_Learning_Trushita
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27Ijariie1117 volume 1-issue 1-page-25-27
Ijariie1117 volume 1-issue 1-page-25-27
 
Supervised learning (2)
Supervised learning (2)Supervised learning (2)
Supervised learning (2)
 
Genetic Algo. for Radial Distribution System to reduce Losses
Genetic Algo. for Radial Distribution System to reduce LossesGenetic Algo. for Radial Distribution System to reduce Losses
Genetic Algo. for Radial Distribution System to reduce Losses
 
machine learning - Clustering in R
machine learning - Clustering in Rmachine learning - Clustering in R
machine learning - Clustering in R
 

Clustering

  • 1. CLUSTERING – GROCERY STORES OF RETAILER X IN KARNATAKA & TAMIL NADU
  • 2. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 2
  • 3. 1. OBJECTIVE A. Creation of 2 sets of clusters: K-Means & Hierarchial B. The clusters should be based on mix of sales by: i. Category and ii. Avg. sales per sq. foot of space 3
  • 4. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 4
  • 5. 2. METHODOLOGY a. Exploratory Data Analysis The MEANS Procedure Variable Label N nmiss Minimum Mean Maximum Std Dev Sum Cat1 Cat1 515 0 120.00 231.82 340.00 66.61 119386.00 Cat2 Cat2 515 0 52.00 150.82 247.00 56.66 77672.00 Cat3 Cat3 515 0 33.00 81.60 212.00 28.44 42022.00 Cat4 Cat4 515 0 90.00 134.37 166.00 20.21 69201.00 Sale Sale 515 0 380.00 598.60 838.00 83.49 308281.00 Size Size 515 0 1200.00 2933.45 3650.00 437.20 1510725.00 Avg_Sales 515 0 0.11 0.21 0.50 0.05 108.34 • Since the given variables, Cat1 – Cat4 are in absolute terms, additional variables PCAT1 – PCAT4 were calculated next as percentages to understand them better as relative variables • Avg_Sales was also calculated as an additional variable • Avg_Sales = Sale / Size 5
  • 6. METHODOLOGY a. Exploratory Data Analysis Overall analysis a. Sales from Category 1 are the highest amongst all the four categories of sales. Hence, Category 1 is the dominating category. b. However, the standard deviation in the amount of sales from Category 1 is also the highest amongst all four categories of Sales. c. The standard deviation in Size of the stores is 437.20 which is on the higher side. d. The mean size of the stores in both states is 2933 sq feet, the maximum being 3650 sq feet e. Assuming that the Sale figures are in '000, the average sale figure per sq foot across all categories in all stores is 210 6
  • 7. 2. METHODOLOGY a. Exploratory Data Analysis SAS Code: **Creating additional variable: 'avg. sale per sq. foot' , PCAT1 PCAT2 PCAT3 PCAT4** ; Data Stores_1 ; Set Stores ; Avg_Sales = Sale / Size ; Run; Data Stores_1 ; Set Stores_1 ; PCAT1 = (Cat1 / (Cat1+Cat2+Cat3+Cat4))*100 ; PCAT2 = (Cat2 / (Cat1+Cat2+Cat3+Cat4))*100 ; PCAT3 = (Cat3 / (Cat1+Cat2+Cat3+Cat4))*100 ; PCAT4 = (Cat4 / (Cat1+Cat2+Cat3+Cat4))*100 ; Run; 7
  • 8. 2. METHODOLOGY a. Exploratory Data Analysis SAS Code: **Perform EDA** ; Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ; Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales; Run; Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ; Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales; Run; Proc Sort Data = Stores_1 ; By State ; Run; Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ; Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales; By State ; Run; 8
  • 9. 2. METHODOLOGY a. Exploratory Data Analysis SAS Code: **Perform EDA** ; Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ; Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales; By State ; Run; Proc FREQ Data = Stores_1 ; Table State ; Run ; 9
  • 10. METHODOLOGY a. Exploratory Data Analysis State=KA Variable Label N nmiss Minimum Mean Maximum Std Dev Sum PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81 PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84 PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52 PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84 Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00 Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00 Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83 State=TN Variable Label N nmiss Minimum Mean Maximum Std Dev Sum PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26 PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24 PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32 PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18 Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00 Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00 Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51 A State-wise analysis of the variables reveals more or less the same patterns for both the states, KA & TN. Category 1 remains the dominating category across both the states Although the average size of stores in both states is roughly the same, a comparison of the minimum store size in both the states shows that there are a few smaller stores in state TN as compared to state KA. 10
  • 11. METHODOLOGY a. Exploratory Data Analysis State=KA Variable Label N nmiss Minimum Mean Maximum Std Dev Sum PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81 PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84 PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52 PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84 Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00 Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00 Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83 State=TN Variable Label N nmiss Minimum Mean Maximum Std Dev Sum PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26 PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24 PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32 PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18 Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00 Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00 Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51 The ranking of the four category of products at these stores remains the same in both the states i.e., Sales of Cat1 > Cat2 > Cat4 > Cat3 The mean sale in state TN is higher than that in state KA though not significantly. This is due a lower count of stores in TN as compared to KA as a result of which TN has a slightly higher mean sales inspite of having lower sales overall. The total count of stores in state KA is higher (55%) than that in state TN (45%). The total volume of sales in state KA is higher than that in state TN which is on expected lines given the higher count of stores in KA as compared to TN. It may therefore be inferred that there are possibly a few stores in state TN that are smaller than the mean size of stores in both states and the average sale per sq. foot in these stores is high. The average sale per sq. foot is roughly the same in both the states. 11
  • 12. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 12
  • 13. 2. METHODOLOGY b. Data Preparation i. Scaling The data was scaled i.e., the following variables were normalised in order to bring them to a comparable level: a. PCAT1 b. PCAT2 c. PCAT3 d. PCAT4 e. Avg_Sales SAS Code: **SCALING in order to standardize the variables** ; Proc Standard Data = Stores_1 Mean = 0 Std = 1 Out = Store_2; Var PCAT1-PCAT4 Avg_Sales; Run; 13
  • 14. 2. METHODOLOGY b. Data Preparation ii Weighting The variable ‘Avg Sales Per Sq. Foot’ was weighted with several iterations as follows: Summary of the results of the weighting iterations performed above: (Detailed results for all the iterations performed above are available on the path: ‘Y:Assignment - ClusteringWeighting’) Iteration # Weight Assigned 1 2 2 3 3 4 4 5 Cluster Summary: Iteration 1 W=2 Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids Ratio 1 127 0.8789 4.2013 3 3.3953 3.86 2 3 1.1296 2.3982 1 7.4713 6.61 3 184 0.7815 3.2693 4 2.5422 3.25 4 201 0.9511 4.6159 3 2.5422 2.67 Cluster Summary: Iteration 2 W=3 Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids Ratio 1 218 0.9394 4.5008 2 3.0989 3.30 2 212 1.0469 3.97 1 3.0989 2.96 3 82 0.9892 4.6592 1 4.5079 4.56 4 3 1.2361 2.6172 3 10.062 8.14 14
  • 15. 2. METHODOLOGY b. Data Preparation Cluster Summary: Iteration 3 W=4 Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids Ratio 1 3 1.3713 2.8962 4 13.3713 9.75 2 226 1.1425 4.8821 3 4.0108 3.51 3 205 0.9991 4.4853 2 4.0108 4.01 4 81 1.1858 5.6984 3 5.8659 4.95 Cluster Summary: Iteration 4 W=5 Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids Ratio 1 202 1.0857 4.4648 2 4.95 4.56 2 229 1.241 5.8923 1 4.95 3.99 3 3 1.5277 3.2196 4 16.71 10.94 4 81 1.4006 6.8317 1 7.28 5.20 The Ratio mentioned above has been calculated using the Difference in Centroids (M) method where: M = D / d1 D = Average distance b/w cluster centroids d1 = Average distance b/w cluster members and centroid 15
  • 16. 2. METHODOLOGY b. Data Preparation SAS Code: *1. Iteration 1 : Weight = 2* ; Data Store_3 ; Set Store_2 ; Avg_Sales2 = Avg_Sales*2 ; Run; **Running the clustering procedure based on K-Means** ; Proc FastClus Data = Store_3 Out = Cluster_1 Maxclusters = 4 Converge = 0 Maxiter = 20 ; Var PCAT1-PCAT4 Avg_Sales2; Run; *2. Iteration 2 : Weight = 3* ; Data Store_4 ; Set Store_3 ; Avg_Sales3 = Avg_Sales*3 ; Run; 16
  • 17. 2. METHODOLOGY b. Data Preparation SAS Code: **Running the clustering procedure based on K-Means** ; Proc FastClus Data = Store_4 Out = Cluster_2 Maxclusters = 4 Converge = 0 Maxiter = 20 ; Var PCAT1-PCAT4 Avg_Sales3; Run; *3. Iteration 3 : Weight = 4* ; Data Store_5 ; Set Store_4 ; Avg_Sales4 = Avg_Sales*4 ; Run; **Running the clustering procedure based on K-Means** ; Proc FastClus Data = Store_5 Out = Cluster_3 Maxclusters = 4 Converge = 0 Maxiter = 20 ; Var PCAT1-PCAT4 Avg_Sales4; Run; 17
  • 18. 2. METHODOLOGY b. Data Preparation SAS Code: *3. Iteration 4 : Weight = 5* ; Data Store_6 ; Set Store_5 ; Avg_Sales5 = Avg_Sales*5 ; Run; **Running the clustering procedure based on K-Means** ; Proc FastClus Data = Store_6 Out = Cluster_4 Maxclusters = 4 Converge = 0 Maxiter = 20 ; Var PCAT1-PCAT4 Avg_Sales5; Run; 18
  • 19. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 19
  • 20. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS) : Creation of Preliminary Clusters For detailed results of the Preliminary Cluster analysis and dignostic plots, please refer to the path: Y:Assignment - ClusteringPreliminary_Analysis_Outliers.xlsx Cluster Summary Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids 1 14 0.7555 2.7913 14 2.1932 2 32 0.6678 3.2557 5 1.9093 3 1. 0 11 4.3786 4 60 0.8155 3.286 19 2.2772 5 28 0.6731 2.5255 2 1.9093 6 61 0.7495 2.7836 19 2.6713 7 1. 0 13 4.3876 8 67 0.7811 2.8286 10 2.2277 9 42 0.7811 2.4529 4 2.8634 10 46 0.6996 2.5278 8 2.2277 11 29 0.7186 2.5871 5 2.2318 12 1. 0 13 3.9468 13 1. 0 12 3.9468 14 28 0.6919 2.5985 1 2.1932 15 5 0.6989 1.9953 18 2.3899 16 21 0.6852 2.6402 18 2.1146 17 27 0.6957 2.2399 5 2.198 18 9 0.6001 1.7932 16 2.1146 19 29 0.7757 2.7927 4 2.2772 20 13 0.6923 2.3191 16 2.3297 Hence, Cluster # 3, 7, 12 and 13 appear as outliers with only single observation in each. The remaining clusters appear to be reasonably sized. 20
  • 21. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS) The following are the details of the clusters that have been identified as outliers: Detection of Outliers Store_Num CLUSTER 36 3 225 7 360 12 179 13 Cluster=3 Variable Mean Population Mean Population Std Dev Z-Score PCAT1 33.39 38.42 8.80 0.57 PCAT2 12.32 24.93 8.26 1.53 PCAT3 33.07 13.77 4.88 3.96 PCAT4 21.22 22.88 4.71 0.35 Avg_Sales 0.22 0.21 0.05 0.26 Cluster=7 Variable Mean Population Mean Population Std Dev Z-Score PCAT1 40.14 38.42 8.80 0.20 PCAT2 32.03 24.93 8.26 0.86 PCAT3 4.76 13.77 4.88 1.85 PCAT4 23.08 22.88 4.71 0.04 Avg_Sales 0.45 0.21 0.05 4.50 Cluster=12 Variable Mean Population Mean Population Std Dev Z-Score PCAT1 40.83 38.42 8.80 0.27 PCAT2 31.60 24.93 8.26 0.81 PCAT3 15.47 13.77 4.88 0.35 PCAT4 12.09 22.88 4.71 2.29 Avg_Sales 0.50 0.21 0.05 5.50 Cluster=13 Variable Mean Population Mean Population Std Dev Z-Score PCAT1 45.55 38.42 8.80 0.81 PCAT2 15.66 24.93 8.26 1.12 PCAT3 20.11 13.77 4.88 1.30 PCAT4 18.68 22.88 4.71 0.89 Avg_Sales 0.47 0.21 0.05 4.91 21
  • 22. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers Store_Num CLUSTER Cat1 Cat2 Cat3 Cat4 Size Sale State Avg_Sales PCAT1 PCAT2 PCAT3 PCAT4 36 3 214 79 212 136 2860 641KA 224.13 33.39 12.32 33.07 21.22 225 7 287 229 34 165 1600 715KA 446.88 40.14 32.03 4.76 23.08 360 12 314 243 119 93 1540 769TN 499.35 40.83 31.60 15.47 12.09 179 13 256 88 113 105 1200 562TN 468.33 45.55 15.66 20.11 18.68 • Average size of all the stores in the data set is 2933 sq. feet. Thus for store # 225, 360 & 179 the size is considerably less. • Avg Sales Per Sq. Foot for all the stores is 210 whereas for Store # 225, 360 & 179 it is more than double the overall mean avg sales per sq. foot. This is due to the smaller size of these stores as compared to the size of all other stores. • For Store # 36 the sales from CAT3 has a percentage share 33% of the total sales from that store. Whereas, the average percentage share of CAT3 in all the stores is appox 14% 22
  • 23. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers- Diagnostic Plots 23
  • 24. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers-Diagnostic Plots 24
  • 25. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers SAS Code: **Performing the Clustering procedure using K-Means with iterations to determine the optimal no. of clusters** ; *Conducting a preliminary cluster analysis to detect outliers, if any* ; Proc Fastclus Data = Store_6 Out = Cluster_Prelim Maxclusters = 20 Converge = 0 Outstat=Stat_Prelim_0; Var PCAT1-PCAT4 Avg_Sales5; Run; Legend1 Frame Cframe = Ligr Label=none Cborder=Black Position = Center Value = (Justify=Center) ; Axis1 Label = (Angle=90 Rotate=0) Minor=None ; Axis2 Minor = None ; Proc Gplot Data = Cluster_Prelim ; Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Cluster_Prelim ; Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; 25
  • 26. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers SAS Code: Proc Gplot Data = Cluster_Prelim ; Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Cluster_Prelim ; Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; *Preparation of data set obtained from merging procedures in order to make a cluster wise analysis of the outliers, if any* ; Proc Sort Data = Cluster_Prelim ; By Cluster; Run; Data Cluster_Pre_1 ; Set Cluster_Prelim ; Keep Store_Num Cluster ; Run; 26
  • 27. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers SAS Code: Proc Export Data = Cluster_Pre_1 outfile = 'Y:Assignment - ClusteringCluster_Pre_1.csv' DBMS=CSV Replace ; Run; *Merging data set named Cluster_Pre_1 with data set Stores_1* ; Proc Sort Data = Cluster_Pre_1 ; By Store_Num ; Run; Proc Sort Data = Stores_1 ; By Store_Num; Run; Data Store_1_Merged ; Merge Cluster_Pre_1 (in=a) Stores_1 (in=b) ; By Store_Num ; If a and b ; Run; Proc Export Data = Store_1_Merged Outfile = 'Y:Assignment - ClusteringStore_1_Merged.csv' DBMS = CSV Replace ; Run; 27
  • 28. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers SAS Code: Proc Sort Data = Store_1_Merged ; By Cluster ; Run; Proc Means Data = Store_1_Merged Mean ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ; By Cluster ; Where Cluster IN(3,7,12,13) ; Run; Proc Means Data = Stores_1 Mean Std ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ; Run; Proc Means Data = Stores_1 Mean ; Var Size Avg_Sales ; Run; 28
  • 29. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an alternative approach An alternative approach for detection and treatment of outliers was attempted. The following are the steps that were undertaken for the process of detection and treatment of outliers: STEP 1: Run Proc FASTCLUS with many clusters and OUTSEED = output data set for diagnostic plot (Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step1_Mean1.xlsx’) 29
  • 30. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an alternative approach STEP 2: Remove low frequency clusters The data set, MEAN1, generated in the above step was used to remove low frequency clusters ( < 5) and clusters with a frequency of 5 or more were retained for subsequent analysis. The data set with clusters having 5 or more frequency was named as 'Seed1'. STEP 3: Proc FASTCLUS was run again selecting seeds from high frequency clusters obtained in data set SEED1 in Step 2 above using LEAST = 1 Clustering Criterion Value for LEAST should be < 2 in order to reduce the effect of outliers on cluster centers (Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step3_LEAST.xlsx’) STEP 4: Proc FASTCLUS was run again selecting seeds from high frequency clusters in previous analysis with STRICT=3 preventing outliers from distorting the results Value of STRICT = 3 was chosen to be close to _GAP_ & _RADIUS_ values of the larger clusters in the diagnostic plots. 30
  • 31. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an alternative approach However, STRICT option is not supported in WPS for Proc FASTCLUS in the present version. Subsequently, a final Proc FASTCLUS could not be run to assign outliers and tails to clusters using seeds that would have been generated from using STRICT option above. SAS Code: ***Another method for identification and treatment of outliers*** ; *STEP 1 : Run PROC FASTCLUS with many clusters and OUTSEED = output data set for diagnostic plot*; Proc Fastclus Data = Store_6 Outseed = Mean1 Maxclusters = 20 Maxiter = 0 Summary ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; Axis1 Label = (Angle=90 Rotate=0) Minor=None Order=(0 to 10 by 2) ; Axis2 minor = None ; Proc Gplot Data = Mean1 ; Plot _GAP_*_FREQ_ _RADIUS_*_FREQ_ / Overlay Frame cframe = ligr vaxis = axis1 haxis=axis2 legend= legend1 ; Run; 31
  • 32. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an alternative approach SAS Code: *Step 2 :Remove Low Frequency clusters* ; Data Seed1 ; Set Mean1 ; If _FREQ_ >=5 ; Run; *Step 3 : Run Proc Fastclus again selecting seeds from high frequency clusters in previous analysis using LEAST = 1 Clustering Criterion since value < 2 reduce the effect of outliers on cluster centers* ; Proc FASTCLUS Data = Store_6 Seed = Seed1 Maxclusters = 8 Least = 1 Out = Store_6_Least ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; Legend1 Frame Cframe = ligr Label = None CBorder = Black Position=Center Value= (Justify=Center) ; Axis1 Label =(Angle=90 Rotate=0) Minor=None ; Axis2 Minor=None ; Proc Gplot Data = Store_6_Least ; Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; 32
  • 33. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an alternative approachSAS Code: Proc Gplot Data = Store_6_Least ; Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; Proc Gplot Data = Store_6_Least ; Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; Proc Gplot Data = Store_6_Least ; Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; *Step 4 : Run Proc Fastclus again, selecting seeds from high frequency clusters in previous analysis with STRICT = to prevent the outliers from distorting the results* *Value of STRICT = is chosen to be close to the _GAP_ & _RADIUS_ values of the large clusters in the diagnostic plot* ; Proc Fastclus Data = Store_6 Seed = Seed1 Maxclusters = 8 Strict=3 out = Store_6_Strict Outseed = Mean2 ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; 33
  • 34. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers Performing iterations for determining the appropriate no. of clusters using K-Means (PROC FASTCLUS) From the procedure run in Step 1 of the alternative method discussed in the preceeding slide for outlier detection, it was found that 8 and above could be a good no. for meaningful cluster formation. Hence, the iterations below begin with Maxclusters = 8-10 Clustering is performed on the data set from which the outliers have been removed. Iteration 1 : Maxclusters = 8 Statistic used for comparison Name (as it appears in the output) Value_Iteration1 Pseudo F Stat Pseudo F Statistics 451.71 Appox. Expected overall R^2 Overall R-Square 0.84 Detailed output and plots on path Y:Assignment - ClusteringIteration_1_Maxclust_8.xlsx Iteration 2 : Maxclusters = 9 Statistic used for comparison Name (as it appears in the output) Value_Iteration2 Pseudo F Stat Pseudo F Statistics 420.29 Appox. Expected overall R^2 Overall R-Square 0.87 Detailed output and plots on path Y:Assignment - ClusteringIteration_2_Maxclust_9.xlsx Iteration 3 : Maxclusters = 10 Statistic used for comparison Name (as it appears in the output) Value_Iteration3 Pseudo F Stat Pseudo F Statistics 391.10 Appox. Expected overall R^2 Overall R-Square 0.88 Detailed output and plots on path Y:Assignment - ClusteringIteration_3_Maxclust_10.xlsx Points considered for a comparison of the above 3 iterations: 1 Relatively large values of Pseudo F Stat indicate a stopping point 2 Higher values of overall R-Square are desirable 3 Increasing the no. of clusters although not much differentiation exists amongst the iterations means devising more marketing strategies unique to each cluster. Given a cost vs. benefit analysis, it is preferable to have a smaller no. of clusters. Hence, iteration 2 wherein 9 clusters are formed seems most appropriate in the present case. 34
  • 35. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers SAS Code: *Deleting the outliers found (in the procedures above) from the scaled and weighted data set* ; Data Store_6_Final ; Set Store_6 ; If Store_Num IN(36 179 225 360) Then Delete ; Run; *Iteration 1 : Maxclusters = 8 * ; Proc FastClus Data = Store_6_Final Maxclusters = 8 Maxiter= 20 Converge = 0 Out=Clusters_8 ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; *Generating Plots of clusters*; Legend1 Frame Cframe = Ligr Label=none Cborder=Black Position = Center Value = (Justify=Center) ; Axis1 Label = (Angle=90 Rotate=0) Minor=None ; Axis2 Minor = None ; 35
  • 36. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers SAS Code: Proc Gplot Data = Clusters_8 ; Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Clusters_8 ; Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Clusters_8 ; Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Clusters_8 ; Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; 36
  • 37. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers SAS Code: *Iteration 2 : Maxclusters = 9 * ; Proc FastClus Data = Store_6_Final Maxclusters = 9 Maxiter= 20 Converge = 0 Mean= Mean_Clusters_9 Out=Clusters_9 ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; *Generating Plots of clusters*; Legend1 Frame Cframe = Ligr Label=none Cborder=Black Position = Center Value = (Justify=Center) ; Axis1 Label = (Angle=90 Rotate=0) Minor=None ; Axis2 Minor = None ; Proc Gplot Data = Clusters_9 ; Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; 37
  • 38. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers SAS Code: Proc Gplot Data = Clusters_9 ; Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Clusters_9 ; Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Clusters_9 ; Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; *Iteration 3 : Maxclusters = 10 * ; Proc FastClus Data = Store_6_Final Maxclusters = 10 Maxiter= 20 Converge = 0 Out=Clusters_10 ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; 38
  • 39. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers SAS Code: *Generating Plots of clusters*; Legend1 Frame Cframe = Ligr Label=none Cborder=Black Position = Center Value = (Justify=Center) ; Axis1 Label = (Angle=90 Rotate=0) Minor=None ; Axis2 Minor = None ; Proc Gplot Data = Clusters_10 ; Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Clusters_10 ; Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; 39
  • 40. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers SAS Code: Proc Gplot Data = Clusters_10 ; Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; Proc Gplot Data = Clusters_10 ; Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr Legend = Legend1 vaxis = axis1 haxis=axis2 ; Run; *Merging the data sets for analysis of the final clusters formed* ; Proc Sort Data = Stores_1 ; By Store_Num ; Run; Data Stores_1_Final ; Set Stores_1 ; If Store_Num IN(36 179 225 360) Then Delete ; Run; 40
  • 41. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers SAS Code: Data Cluster_9_Final ; Set Clusters_9 ; Keep Store_Num Cluster ; Run; Proc Sort Data = Cluster_9_Final ; By Store_Num ; Run; Data Stores_1_Final_Merged ; Merge Stores_1_Final (in=a) Cluster_9_Final (in=b); By Store_Num ; If a and b ; Run; 41
  • 42. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 42
  • 43. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters Cluster Summary Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids Ratio 1 46 0.8678 3.424 3 3.3888 3.91 2 57 0.9084 3.2855 6 3.7655 4.15 3 83 0.819 2.9486 5 2.8481 3.48 4 41 0.7917 2.8507 6 2.8616 3.61 5 99 0.8254 2.6116 3 2.8481 3.45 6 67 0.773 3.0078 4 2.8616 3.70 7 81 0.7999 2.9993 4 2.9345 3.67 8 7 0.8229 2.68 9 3.2939 4.00 9 30 0.7436 2.6002 8 3.2939 4.43 • Ratio has been calculated using the ‘Difference in Centroids’ method as D / d1 where: D = Average distance b/w cluster centroids d1 = Average distance b/w members and cluster centroid • Thus, the ratio signifies the strength of the clusters formed and is a measure of the homogeneity within compared to the heterogeneity outside • Cluster 9 is the strongest of all other cluster formations followed by Cluster 2 & Cluster 8 43
  • 44. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters The 9 clusters obtained in the preliminary cluster analysis have been evaluated and profiled as under in order to gain insights into the variables that are most dominating in the cluster formation:(Detailed output on path ‘Y:Assignment – ClusteringIteration_2_Maxclust_9.xlsx) Cluster=1 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 46 39.92 38.41 8.82 0.17 PCAT2 46 27.07 24.95 8.25 0.26 PCAT3 46 12.58 13.73 4.80 0.24 PCAT4 46 20.42 22.91 4.70 0.53 Avg_Sales_Final 46 272.14 208.80 48.77 1.30 Cluster=2 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 57 33.41 38.41 8.82 0.57 PCAT2 57 20.16 24.95 8.25 0.58 PCAT3 57 16.95 13.73 4.80 0.67 PCAT4 57 29.49 22.91 4.70 1.40 Avg_Sales_Final 57 142.44 208.80 48.77 1.36 Cluster=3 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 83 39.70 38.41 8.82 0.15 PCAT2 83 26.23 24.95 8.25 0.16 PCAT3 83 13.58 13.73 4.80 0.03 PCAT4 83 20.49 22.91 4.70 0.52 Avg_Sales_Final 83 236.59 208.80 48.77 0.57 Cluster=4 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 41 31.45 38.41 8.82 0.79 PCAT2 41 20.68 24.95 8.25 0.52 PCAT3 41 21.16 13.73 4.80 1.55 PCAT4 41 26.71 22.91 4.70 0.81 Avg_Sales_Final 41 183.11 208.80 48.77 0.53 Cluster=5 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 99 37.08 38.41 8.82 0.15 PCAT2 99 28.40 24.95 8.25 0.42 PCAT3 99 12.62 13.73 4.80 0.23 PCAT4 99 21.90 22.91 4.70 0.22 Avg_Sales_Final 99 207.18 208.80 48.77 0.03 Cluster=6 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 67 31.57 38.41 8.82 0.78 PCAT2 67 33.61 24.95 8.25 1.05 PCAT3 67 10.78 13.73 4.80 0.61 PCAT4 67 24.04 22.91 4.70 0.24 Avg_Sales_Final 67 173.18 208.80 48.77 0.7344
  • 45. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters Cluster=7 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 81 49.76 38.41 8.82 1.29 PCAT2 81 15.03 24.95 8.25 1.20 PCAT3 81 12.69 13.73 4.80 0.22 PCAT4 81 22.52 22.91 4.70 0.08 Avg_Sales_Final 81 183.23 208.80 48.77 0.52 Cluster=8 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 7 39.64 38.41 8.82 0.14 PCAT2 7 26.63 24.95 8.25 0.20 PCAT3 7 13.52 13.73 4.80 0.04 PCAT4 7 20.21 22.91 4.70 0.58 Avg_Sales_Final 7 351.02 208.80 48.77 2.92 Cluster=9 Variable N Mean Population Mean Population Std Dev Z-Score PCAT1 30 40.27 38.41 8.82 0.21 PCAT2 30 28.77 24.95 8.25 0.46 PCAT3 30 12.71 13.73 4.80 0.21 PCAT4 30 18.25 22.91 4.70 0.99 Avg_Sales_Final 30 316.83 208.80 48.77 2.21 45
  • 46. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters 1. 7% of the total stores have their avg sales per sq. foot significantly higher than the overall average. 2. 11% of the total stores have significantly higher than overall average sales in the category of Tobacco & Alcohol. 3. 16% of the total stores have lower than the overall average sales in the category of Tobacco & Alcohol though the difference is not significant. 4. 32% of the total stores have higher than overall average sales in the category of Frozen Foods. Average sales in Cluster 6 for the category of Frozen Foods is significantly higher than the overall mean sales for the same category. Cluster # No. of stores 8 7 9 30 Cluster # No. of stores 2 57 Cluster # No. of stores 3 83 Cluster # No. of stores 5 99 6 67 46
  • 47. The FREQ Procedure Table of CLUSTER by State CLUSTER (Cluster) State (State) Total KA TN Frequency Percent 1 30 16 46 5.87 3.13 9 2 33 24 57 6.46 4.7 11.15 3 44 39 83 8.61 7.63 16.24 4 25 16 41 4.89 3.13 8.02 5 49 50 99 9.59 9.78 19.37 6 37 30 67 7.24 5.87 13.11 7 43 38 81 8.41 7.44 15.85 8 3 4 7 0.59 0.78 1.37 9 16 14 30 3.13 2.74 5.87 Total 280 231 511 54.79 45.21 100 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters The overall bias in the number of stores is towards the state KA with 55% of the total stores being in KA. No other significant pattern in the distribution of stores has emerged. 47
  • 48. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters Analysis Variable: Size Cluster N Obs N Mean Population Mean Population Std Dev Z-Score Minimum Maximum 1 46 46 2471.52 2942.32 423.52 1.11 1700 2910 2 57 57 3334.74 2942.32 423.52 0.93 2550 3650 3 83 83 2761.39 2942.32 423.52 0.43 1925 3330 4 41 41 3040.98 2942.32 423.52 0.23 2180 3650 5 99 99 2985.56 2942.32 423.52 0.10 2000 3610 6 67 67 3184.63 2942.32 423.52 0.57 2200 3630 7 81 81 3172.10 2942.32 423.52 0.54 2600 3650 8 7 7 1977.14 2942.32 423.52 2.28 1550 2150 9 30 30 2205.33 2942.32 423.52 1.74 1750 2520 Appox. 7% of all the stores have a mean size significantly lower than the overall size of all the stores. The split of these stores b/w the two states is roughly the same and there is no discerning pattern observed. 48
  • 49. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9 clusters) 49
  • 50. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9 clusters) 50
  • 51. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters) SAS Code: *Profiling the 9 clusters obtained in the preceeding procedures* ; Proc Sort Data = Stores_1_Final_Merged ; By Cluster ; Run; Data Stores_1_Final_Merged ; Set Stores_1_Final_Merged; Avg_Sales_Final = Avg_Sales * 1000 ; Run; Proc Means Data = Stores_1_Final_Merged N Mean ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ; By Cluster ; Run; Proc Means Data = Stores_1_Final_Merged N Mean Std; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ; Run; 51
  • 52. 2. METHODOLOGY c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters) SAS Code: Proc Means Data = Stores_1_Final_Merged N ; Class State ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ; Run; Proc Freq Data = Stores_1_Final_Merged ; Tables Cluster * State / nocol norow ; Run; Proc Means Data = Stores_1_Final_Merged N Mean ; Var Size ; Class Cluster ; Run; Proc Means Data = Stores_1_Final_Merged Mean Std ; Var Size ; Run; 52
  • 53. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 53
  • 54. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER) With 9 clusters, treated for outliers, obtained from the preliminary cluster analysis using PROC FASTCLUS procedure (K-Means Method for Clustering), Hierarchial Clustering is performed next using the PROC CLUSTER procedure to obtain the final no. of clusters. The following methods are used for Hierarchial Clustering: Note: K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis. Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set. Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9. S.No Method # of Clusters obtained Remarks 1 Ward's Method 3 The scatter diagram of the clusters obtained revealed cluster formations that were not well demarcated Also, the profiling of these 3 clusters didn't reveal any variable that was dominant in the formation of the clusters. For detailed results, refer to the tab named 'Output_Wards' & 'Output_Final_Profiling_W' 2 Density Method Ties were observed while Density method was used. Based on the position of the Ties in the Cluster History, the clusters obtained when K=7 were finalized. K=7 5 For detailed results, refer to the tab named 'Output_Density_K7' & 'Output_Final_Profiling_D' K=8 4 For detailed results, refer to the tab named 'Output_Density_K8'. K=9 5 For detailed results, refer to the tab named 'Output_Density_K9'. 54
  • 55. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’ (refer to tab named ‘Output_Wards’) Cluster History Number Of Clusters First Cluster Joined Second Cluster Jointed Frequency Of New Cluster Semipartial RSq RSquared Pseudo F Statistic Pseudo t-squared Approximate Expected RSq Cubic Clustering Criteria Tie 8 8 9 37 0.0054 0.9913,000 . . . 7 4 6 108 0.0184 0.98 3436. . . 6 1 3 129 0.0301 0.95 1772. . . 5 CL7 7 189 0.0322 0.91 1343 327. . 4 CL5 5 288 0.0438 0.87 1131 248. . 3 2 CL4 345 0.0952 0.77 874 346. . 2 CL6 CL8 166 0.1266 0.65 938 585. . 1 CL2 CL3 511 0.6483 0. 938 0 0 # of clusters according to: Pseudo T-Square: 3, 2 Semipartial R-Square: 8,7,6,5,4,3 Therefore, final # of clusters considered on the basis of the results of Ward's Method = 3 The Cluster History, from the Ward’s method, is as below: 55
  • 56. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method The Tree diagram, from the Ward’s method, is as below: 56
  • 57. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method The following are the plots obtained from the Ward’s method: 57
  • 58. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method The following are the plots obtained from the Ward’s method: 58
  • 59. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method The 3 clusters obtained from the Ward’s method have been profiled as below: Analysis Variable: Cluster_Final Cluster_Final N Obs N 1 103 103 2 242 242 3 166 166 Cluster_Final=1 Variable N Mean Popltn Mean Popltn Std. Dev Z-Score PCAT1 103 36.32 38.41 8.82 0.24 PCAT2 103 23.25 24.95 8.25 0.21 PCAT3 103 15.00 13.73 4.80 0.26 PCAT4 103 25.44 22.91 4.70 0.54 Avg_Sales_Final 103 200.36 208.80 48.77 0.17 Cluster_Final=2 Variable N Mean Popltn Mean Popltn Std. Dev Z-Score PCAT1 242 41.74 38.41 8.82 0.38 PCAT2 242 21.87 24.95 8.25 0.37 PCAT3 242 14.46 13.73 4.80 0.15 PCAT4 242 21.94 22.91 4.70 0.21 Avg_Sales_Final 242 222.93 208.80 48.77 0.29 Cluster_Final=3 Variable N Mean Popltn Mean Popltn Std. Dev Z-Score PCAT1 166 34.85 38.41 8.82 0.40 PCAT2 166 30.50 24.95 8.25 0.67 PCAT3 166 11.88 13.73 4.80 0.39 PCAT4 166 22.76 22.91 4.70 0.03 Avg_Sales_Final 166 193.45 208.80 48.77 0.31 Table of Cluster_Final by State Cluster_Final State (State) Total KA TN Frequency Percent 1 63 40 103 12.33 7.83 20.16 2 131 111 242 25.64 21.72 47.36 3 86 80 166 16.83 15.66 32.49 Total 280 231 511 54.79 45.21 100 Analysis Var_Size Cluster_Final Mean Size Popltn Mean Popltn Std. Dev Z-Score 1 2949.22 2942.32 423.52 0.016 2 2854.61 2942.32 423.52 0.207 3 3065.9 2942.32 423.52 0.292 Conclusion: • Thus, both the graphical plots as well as the summary stats of the 3 clusters obtained using the Ward’s method reveal no clear cluster formation. • As such, no particular variable has been found dominating in any of the 3 cluster formations. 59
  • 60. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method SAS Code: **Hierarchial Clustering procedure being performed on the 9 preliminary clusters obtained using K-Means** ; **The data set using which K-Means clustering was performed to obtain the preliminary 9 clusters has been treated for outliers and hence doesn't contain any outliers** ; *Ward's Method* ; Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_W Method = Ward CCC Pseudo ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Copy Cluster ; ID Cluster ; Run; Proc Tree Data = Tree_9_W Horizontal Lines=(color=blue) out = Tree_Out_9_W nclusters = 3 ; Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; ID Cluster ; Run; Proc Print Data = Tree_Out_9_W ; Run; 60
  • 61. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method SAS Code: *Profiling of the Clusters formed using Ward's Method* ; *Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary cluster analysis have been mapped to the final 3 clusters obtained by the Ward's method* ; Data Stores_Final_Analysis_W ; Set Stores_1_Final_Merged ; If Cluster = 1 OR Cluster = 2 Then Cluster_Final_W = 1 ; Else If Cluster = 5 OR Cluster = 6 Then Cluster_Final_W = 3 ; Else If Cluster = 3 OR Cluster= 4 OR Cluster=7 OR Cluster= 8 OR Cluster= 9 Then Cluster_Final_W = 2 ; Run ; Proc Sort Data = Stores_Final_Analysis_W ; By Cluster_Final_W ; Run; Proc Means Data = Stores_Final_Analysis_W N; Var Cluster_Final_W; Class Cluster_Final_W ; Run; 61
  • 62. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method SAS Code: Data Stores_Final_Analysis_W ; Set Stores_Final_Analysis_W; Avg_Sales_Final = Avg_Sales * 1000 ; Run; Proc Means Data = Stores_Final_Analysis_W N Mean ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ; By Cluster_Final_W ; Run; Proc Means Data = Stores_Final_Analysis_W N Mean Std; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ; Run; Proc Freq Data = Stores_Final_Analysis_W ; Tables Cluster_Final_W*State / nocol norow nocum; Run; Proc Means Data = Stores_Final_Analysis_W N Mean ; Var Size ; By Cluster_Final_W ; Run; 62
  • 63. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method SAS Code: Proc Means Data = Stores_Final_Analysis_W Mean Std ; Var Size ; Run; Legend1 Frame Cframe = ligr cborder=black position=center value=(justify=center) ; Axis1 label=(angle=90 rotate=0) minor=none ; Axis2 minor=none ; Proc Gplot ; Plot PCAT1 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; Proc Gplot ; Plot PCAT2 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; Proc Gplot ; Plot PCAT3 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; Proc Gplot ; Plot PCAT4 * Avg_Sales_Final = Cluster_Final_W / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; 63
  • 64. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method The output for the Density method discussed below and in the following slides is when K=7. (Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’. Refer tab named ‘Output_Density_K7’. For output when K=8 & K=9 refer tab named ‘Output_Density_K8’ & ‘Output_Density_K9’.) Cluster History Number Of Clusters First Cluster Joined Second Cluster Jointed Frequency Of New Cluster Semipartial RSq RSquared Pseudo F Statistic Pseudo t- squared Approximate Expected RSq Cubic Clustering Criteria Normalized Fusion Density Lesser Density Greater Density Tie 8 3 5 182 0.0324 0.97 2147. . . 61.799 44.7166 100 7 CL8 7 263 0.0826 0.89 647 665. . 38.79 24.0617 100 6 CL7 1 309 0.1255 0.76 319 335. . 35.798 21.8011 100 T 5 CL6 4 350 0.0541 0.71 303 78.3. . 35.798 21.8011 100 4 CL5 6 417 0.0911 0.61 269 128. . 26 14.9422 100 3 CL4 2 474 0.1869 0.43 190 229. . 7.2274 3.7492 100 2 CL3 9 504 0.3124 0.12 66.2 274. . 6.0544 3.1217 100 1 CL2 8 511 0.1151 0 . 66.2 0 0 2.1174 1.07 100 # of clusters according to: Pseudo T-Square: 5, 4 Semipartial R-Square: 8,7,5,4 Therefore, final # of clusters considered in this iteration = 5 Since the Tie occurs in the early history of the cluster formation, it should have only a little effect on the later stages and hence can be overlooked. 64
  • 65. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method The Tree diagram, from the Density method when K=7, is as below: 65
  • 66. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method The following are the plots obtained from the Density method when K=7: 66
  • 67. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method The following are the plots obtained from the Density method when K=7: 67
  • 68. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method The following elaborates on the profiles of the final 5 clusters obtained from the Density method: Analysis Variable: Cluster_Final_D Cluster_Final_D N Obs N 1 326 326 2 67 67 3 81 81 4 7 7 5 30 30 Cluster_Final_D=1 Variable N Mean Popltn Mean Popltn Std Dev Z-Score PCAT1 326 36.80 38.41 8.82 0.18 PCAT2 326 25.25 24.95 8.25 0.04 PCAT3 326 14.69 13.73 4.80 0.20 PCAT4 326 23.26 22.91 4.70 0.07 Avg_Sales_Final 326 209.48 208.80 48.77 0.01 • No particular variable has emerged as a dominating variable responsible for the formation of this cluster. • Mean values of the variables in this cluster are very near to the overall mean scores of the variables in the data set. Legend: Cat1 Fresh Foods Cat2 Frozen Foods Cat3 Health & Beauty Cat4 Tobacco & Alcohol Cluster_Final_D=2 Variable N Mean Popltn Mean Popltn Std Dev Z-Score PCAT1 67 31.57 38.41 8.82 0.78 PCAT2 67 33.61 24.95 8.25 1.05 PCAT3 67 10.78 13.73 4.80 0.61 PCAT4 67 24.04 22.91 4.70 0.24 Avg_Sales_Final 67 173.18 208.80 48.77 0.73 PCAT2 has emerged as a dominating variable and is the most determining variable in the formation of this cluster with nearly 13% of the total no. of stores having a mean higher than 25%. 68
  • 69. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method Cluster_Final_D=3 Variable N Mean Popltn Mean Popltn Std Dev Z-Score PCAT1 81 49.76 38.41 8.82 1.29 PCAT2 81 15.03 24.95 8.25 1.20 PCAT3 81 12.69 13.73 4.80 0.22 PCAT4 81 22.52 22.91 4.70 0.08 Avg_Sales_Final 81 183.23 208.80 48.77 0.52 Both PCAT1 and PCAT2 have emerged as the dominating variables in Cluster 1 with nearly 16% of the total no. of stores having a mean higher than the overall mean of these 2 categories. Cluster_Final_D=4 Variable N Mean Popltn Mean Popltn Std Dev Z-Score PCAT1 7 39.64 38.41 8.82 0.14 PCAT2 7 26.63 24.95 8.25 0.20 PCAT3 7 13.52 13.73 4.80 0.04 PCAT4 7 20.21 22.91 4.70 0.58 Avg_Sales_Final 7 351.02 208.80 48.77 2.92 Avg Sales per Sq. Foot has emerged as the dominating variable in Cluster 4 with mean avg sales per sq. foot significantly higher than the mean overall avg sales per sq. foot with nearly 1.4% of the total no. of stores having a mean greater than the overall mean of avg sales per sq. foot.Cluster_Final_D=5 Variable N Mean Popltn Mean Popltn Std Dev Z-Score PCAT1 30 40.27 38.41 8.82 0.21 PCAT2 30 28.77 24.95 8.25 0.46 PCAT3 30 12.71 13.73 4.80 0.21 PCAT4 30 18.25 22.91 4.70 0.99 Avg_Sales_Final 30 316.83 208.80 48.77 2.21 Avg Sales per Sq. Foot has emerged as the dominating variable in Cluster 5 with mean avg sales per sq. foot significantly higher than the mean overall avg sales per sq. foot with nearly 6% of the total no. of stores having a mean greater than the overall mean of avg sales per sq. foot. 69
  • 70. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method The FREQ Procedure Table of Cluster_Final_D by State Cluster_Final_D State (State) Total KA TN Frequency Percent 1 181 145 326 35.42 28.38 63.8 2 37 30 67 7.24 5.87 13.11 3 43 38 81 8.41 7.44 15.85 4 3 4 7 0.59 0.78 1.37 5 16 14 30 3.13 2.74 5.87 Total 280 231 511 54.79 45.21 100 No specific pattern has emerged in the state-wise analysis of the clusters formed. Analysis Var_Size Cluster_Final Mean Size Popltn Mean Popltn Std. Dev Z-Score 1 2923.97 2942.32 423.52 0.04 2 3184.63 2942.32 423.52 0.57 3 3172.1 2942.32 423.52 0.54 4 1977.14 2942.32 423.52 2.28 5 2205.33 2942.32 423.52 1.74 • The average size of the stores in cluster 4 is much lesser than the overall average size of the stores in the given data set. • Hence, the avg sales per sq. foot for stores in this cluster is also significantly higher than the overall average sales per sq. foot. 70
  • 71. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method SAS Code: *Density Method* ; *K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis* ; *Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set*; *Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9* ; *K = 7* ; Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D7 Method = Density K=7 CCC Pseudo ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Copy Cluster ; ID Cluster ; Run; Proc Tree Data = Tree_9_D7 Horizontal Lines=(color=blue) out = Tree_Out_9_D7 nclusters=5 ; Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; ID Cluster ; Run; Proc Print Data = Tree_Out_9_D7 ; Run; 71
  • 72. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method SAS Code: *Profiling of the Clusters formed using Density Method for K=7* ; *Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary cluster analysis have been mapped to the final 5 clusters obtained by the Density method* ; Data Stores_Final_Analysis_D ; Set Stores_1_Final_Merged ; If Cluster = 1 OR Cluster = 2 OR Cluster = 3 OR Cluster = 4 OR Cluster = 5 Then Cluster_Final_D = 1 ; Else If Cluster = 6 Then Cluster_Final_D = 2 ; Else If Cluster = 7 Then Cluster_Final_D = 3 ; Else If Cluster = 8 Then Cluster_Final_D = 4 ; Else If Cluster = 9 Then Cluster_Final_D = 5 ; Run ; Proc Sort Data = Stores_Final_Analysis_D ; By Cluster_Final_D ; Run; Proc Means Data = Stores_Final_Analysis_D N; Var Cluster_Final_D; Class Cluster_Final_D ; Run; 72
  • 73. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method SAS Code: Data Stores_Final_Analysis_D ; Set Stores_Final_Analysis; Avg_Sales_Final = Avg_Sales * 1000 ; Run; Proc Means Data = Stores_Final_Analysis_D N Mean ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ; By Cluster_Final_D ; Run; Proc Means Data = Stores_Final_Analysis_D N Mean Std; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ; Run; Proc Freq Data = Stores_Final_Analysis_D ; Tables Cluster_Final_D*State / nocol norow nocum; Run; Proc Means Data = Stores_Final_Analysis_D N Mean ; Var Size ; By Cluster_Final_D ; Run; 73
  • 74. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method SAS Code: Proc Means Data = Stores_Final_Analysis_D Mean Std ; Var Size ; Run; Proc Export Data = Stores_Final_Analysis_D Outfile = 'Y:Assignment - ClusteringStores_Final_Analysis_D.csv' DBMS= CSV Replace ; Run; Legend1 Frame Cframe = ligr cborder=black position=center value=(justify=center) ; Axis1 label=(angle=90 rotate=0) minor=none ; Axis2 minor=none ; Proc Gplot ; Plot PCAT1 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; Proc Gplot ; Plot PCAT2 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; 74
  • 75. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method SAS Code: Proc Gplot ; Plot PCAT3 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; Proc Gplot ; Plot PCAT4 * Avg_Sales_Final = Cluster_Final_D / Frame Cframe=ligr Legend=Legend1 vaxis=axis1 haxis=axis2 ; Run; *K = 8* ; Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D8 Method = Density K=8 CCC Pseudo ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Copy Cluster ; Run; Proc Tree Data = Tree_9_D8 Horizontal Lines=(color=blue) out = Tree_Out_9_D8 nclusters=4 ; Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; Proc Print Data = Tree_Out_9_D8 ; Run; 75
  • 76. 2. METHODOLOGY d. Hierarchial Clustering (PROC CLUSTER): Density Method SAS Code: • *K = 9* ; Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D9 Method = Density K=9 CCC Pseudo ; Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Copy Cluster ; Run; Proc Tree Data = Tree_9_D9 Horizontal Lines=(color=blue) out = Tree_Out_9_D9 nclusters=5 ; Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ; Run; Proc Print Data = Tree_Out_9_D9 ; Run; 76
  • 77. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 77
  • 78. 3. SUMMARY OF INSIGHTS 1. 16% of the stores have their mean sales from the Fresh Food category higher than the overall average in this category. 2. 13% of the stores have their mean sales from the Frozen Food category higher than the overall average in this category. 3. 16% of the stores have their mean sales from the Frozen Food category lower than the overall average in this category. 4. The % sales from the category of Health & Beauty in all the clusters formed above is nearly around the overall mean sales of this category. 5. Only 5% of the total no. of stores have their mean sales from the category Tobacco & Alcohol lower than the overall mean sales of this category. 6. 7% of the total stores have their average sales per sq. foot significantly higher than the overall average. The difference is particularly more pronounced for stores in Cluster 4 in which the average size of the stores is also much lesser than the overall average size. 7. 29% of the total stores have their average sales per sq. foot significantly lower than the overall average. Cluster # No. of stores 3 81 Cluster # No. of stores 2 67 Cluster # No. of stores 3 81 Cluster # No. of stores 4 7 5 30 Cluster # No. of stores 2 67 3 81 78
  • 79. CONTENTS 1. Objective 2. Methodology a. Exploratory Data Analysis b. Data Preparation i. Scaling ii. Weighting c. Preliminary Cluster Analysis (PROC FASTCLUS) i. Creation of K preliminary clusters ii. Detection of Outliers iii. Treatment of Outliers • Alternative treatment of outliers also attempted iv. Clustering of the data set treated for outliers v. Evaluating & Profiling of clusters formed in preliminary analysis d. Hierarchial Clustering (PROC CLUSTER) i. Ward’s Method ii. Density Method 3. Summary of Insights 4. Recommendations 79
  • 80. 4. RECOMMENDATIONS 1. Cluster 3 a The size of stores in this cluster is higher than the average size of all stores though the difference is not significant. b When compared to the overall mean sales of all stores from Fresh Foods category, the contribution to revenue from the Fresh Food Category is highest from stores in this cluster. c However, when compared with the overall mean sales of all stores from the category of Frozen Foods, the contribution to revenue from the Frozen Food category is lowest from stores in this cluster. d The average sales per square foot from stores in this cluster is also lower when compared with the overall average sales per sq. foot of all stores. e The above observations therefore imply that although Fresh Foods category is contributing the most to the sales but perhaps this contribution is not enough to increase the overall sales of the stores which are lesser than the average of all other stores despite a greater size of stores. There is therefore a need to may be adopt techniques such as better placement of such products or a promotional campaign targeted specifically for products in this category. Strategies may also be devised for promoting sales from Frozen Food category as they are significantly lesser than the overall average sales of this category in other stores. One possibility is that sales from Fresh Foods category is cannibalizing the sales from Frozen Foods category and hence an alternative shelf placement is required. 80
  • 81. 4. RECOMMENDATIONS 2. Cluster 2 a As compared to stores in Cluster 3, a contrasting situation is seen for stores in this cluster. b The sales from Frozen Food category are contributing the most to the overall revenue of stores in this cluster and are greater than the overall mean sales from this category in all other stores Whereas, sales from the Fresh Food category are lower than the overall mean sales from this category in other stores. c The average size of stores in this cluster is roughly the same as the size of stores in Cluster 3 and is higher than the overall mean size of other stores. d Also, the average sales per sq. foot is lesser than the overall average sales per sq. foot of other stores. They are also lesser than the average sales per sq. foot of stores in Cluster 3. e Hence, strategies similar to those to be adopted for stores in Cluster 3 may also be replicated for stores in Cluster 2 for promoting sales from both the Fresh Foods category as well as the Frozen Foods category. This may be done after gaining insights into the factors that are driving the Frozen Food sales in stores of Cluster 2 and Fresh Food sales in stores of Cluster 3. 81
  • 82. 4. RECOMMENDATIONS 3. Cluster 1 4. Cluster 4 a Stores in Cluster 1, roughly 64%, are highest in no. as compared to stores in other clusters. b Sales from all 4 categories of products of stores in this cluster are very close to the overall mean sales of each of the four categories in all the stores. c The average size of the stores in this cluster is also very close to the overall average size of all stores. d Since this cluster has the highest and a significant % of no. of stores, promotional activities adopted for all these stores can perhaps also help in significantly increasing the overall sales volume of the Retailer X. a This cluster houses only 1% of the total stores with the only differentiating factor being the average sales per sq. foot which is significantly higher than the overall average for other stores. b The mean sales of products in each of the 4 categories is very similar to the overall mean sales of those categories. c Hence, the only possible reason for a significantly higher average sales per sq. foot is the lower than overall average size of the stores. No specific state-wise pattern has emerged for these stores with the distribution being fairly consistent in both the states, KA & TN. 82
  • 83. 4. RECOMMENDATIONS 5. Cluster 5 6. a This cluster houses nearly 6% of the total stores b The distribution of variables for stores in this cluster is almost similar to stores in Cluster 4. c However, it may be noted that sales from the Tobacco & Alcohol category are lower than the overall mean sales of other stores from this category. Having identified the drivers of sales in stores of each of the 5 clusters, it is next important to understand other factors that influence each of these drivers. Inclusion of demographic factors such as age, income, location, gender etc. as additional variables, may give better insights into the promotional strategies, unique to each cluster, that may be adopted for increasing the sales. 83