The document describes a methodology for clustering grocery stores of a retailer in Karnataka and Tamil Nadu based on sales data. Exploratory data analysis was conducted on the sales and store size data, which found that category 1 sales were highest on average. The data was then prepared for clustering by standardizing percentage sales variables, and weighting the average sales per square foot variable through multiple iterations. Preliminary K-means clustering was performed using PROC FASTCLUS to create clusters and identify outliers.
This document presents a chain sampling plan for truncated life tests when product lifetime follows a log-logistic distribution. It provides the minimum sample size needed to ensure a specified acceptance probability while satisfying producer and consumer risks, for various quality levels. Tables 1 and 2 show the minimum sample sizes and operating characteristic functions for the proposed sampling plan for different confidence levels, acceptance numbers, and ratios of test time to scale parameter. For example, a sample size of 10 is required for a confidence level of 0.99, acceptance number of 2, and time-to-scale ratio of 0.942.
Multi Task DPP for Basket Completion by Romain WARLOP, Fifty Fiverecsysfr
Determinantal point processes (DPPs) have received significant attention in the recent years as an elegant model for a variety of machine learning tasks, due to their ability to elegantly model set diversity and item quality or popularity. Recent work has shown that DPPs can be effective models for product recommendation and basket completion tasks. We present an enhanced DPP model that is specialized for the task of basket completion, the multi-task DPP. We view the basket completion problem as a multi-class classification problem, and leverage ideas from tensor factorization and multi-class classification to design the multi-task DPP model. We evaluate our model on several real-world datasets, and find that the multi-task DPP provides significantly better predictive quality than a number of state-of-the-art models.
This document discusses different approaches to multivariate data analysis and clustering, including nearest neighbor methods, hierarchical clustering, and k-means clustering. It provides examples of using Ward's method, average linkage, and k-means clustering on poverty data to identify potential clusters of countries based on variables like birth rate, death rate, and infant mortality rate. Key lessons are that different linkage methods, distance measures, and data normalizations should be tested and that higher-dimensional data may require different variable spaces or transformations to identify meaningful clusters.
The document provides an overview of topics to be covered in a data analysis course, including cluster analysis and decision trees. The course will cover descriptive statistics, probability distributions, correlation, regression, hypothesis testing, clustering methods like k-means, and decision tree techniques like CHAID. Clustering involves grouping similar objects together to identify homogeneous clusters that are heterogeneous from each other. Applications of clustering include market segmentation, credit risk analysis, and operations. The document gives an example of clustering students based on their exam scores.
The document discusses RFM customer segmentation, which segments customers based on their Recency (time since last activity), Frequency (number of activities), and Monetary (total monetary value) metrics. These metrics can be calculated from transaction or engagement data and used to group customers into segments. The segments are identified by analyzing the distribution of the RFM metrics and identifying cut-off points. RFM segmentation allows identifying the most valuable customers and those at high risk of churn for targeted marketing campaigns.
The document provides an overview of cluster analysis techniques. It discusses the need for segmentation to group large populations into meaningful subsets. Common clustering algorithms like k-means are introduced, which assign data points to clusters based on similarity. The document also covers calculating distances between observations, defining the distance between clusters, and interpreting the results of clustering analysis. Real-world applications of segmentation and clustering are mentioned such as market research, credit risk analysis, and operations management.
This document provides an overview of cluster analysis techniques. It defines cluster analysis as classifying cases into homogeneous groups based on a set of variables. The document then discusses how cluster analysis can be used in marketing research for market segmentation, understanding consumer behaviors, and identifying new product opportunities. It outlines the typical steps to conduct a cluster analysis, including selecting a distance measure and clustering algorithm, determining the number of clusters, and validating the analysis. Specific clustering methods like hierarchical, k-means, and deciding the number of clusters using the elbow rule are explained. The document concludes with an example of conducting a cluster analysis in SPSS.
This document presents a chain sampling plan for truncated life tests when product lifetime follows a log-logistic distribution. It provides the minimum sample size needed to ensure a specified acceptance probability while satisfying producer and consumer risks, for various quality levels. Tables 1 and 2 show the minimum sample sizes and operating characteristic functions for the proposed sampling plan for different confidence levels, acceptance numbers, and ratios of test time to scale parameter. For example, a sample size of 10 is required for a confidence level of 0.99, acceptance number of 2, and time-to-scale ratio of 0.942.
Multi Task DPP for Basket Completion by Romain WARLOP, Fifty Fiverecsysfr
Determinantal point processes (DPPs) have received significant attention in the recent years as an elegant model for a variety of machine learning tasks, due to their ability to elegantly model set diversity and item quality or popularity. Recent work has shown that DPPs can be effective models for product recommendation and basket completion tasks. We present an enhanced DPP model that is specialized for the task of basket completion, the multi-task DPP. We view the basket completion problem as a multi-class classification problem, and leverage ideas from tensor factorization and multi-class classification to design the multi-task DPP model. We evaluate our model on several real-world datasets, and find that the multi-task DPP provides significantly better predictive quality than a number of state-of-the-art models.
This document discusses different approaches to multivariate data analysis and clustering, including nearest neighbor methods, hierarchical clustering, and k-means clustering. It provides examples of using Ward's method, average linkage, and k-means clustering on poverty data to identify potential clusters of countries based on variables like birth rate, death rate, and infant mortality rate. Key lessons are that different linkage methods, distance measures, and data normalizations should be tested and that higher-dimensional data may require different variable spaces or transformations to identify meaningful clusters.
The document provides an overview of topics to be covered in a data analysis course, including cluster analysis and decision trees. The course will cover descriptive statistics, probability distributions, correlation, regression, hypothesis testing, clustering methods like k-means, and decision tree techniques like CHAID. Clustering involves grouping similar objects together to identify homogeneous clusters that are heterogeneous from each other. Applications of clustering include market segmentation, credit risk analysis, and operations. The document gives an example of clustering students based on their exam scores.
The document discusses RFM customer segmentation, which segments customers based on their Recency (time since last activity), Frequency (number of activities), and Monetary (total monetary value) metrics. These metrics can be calculated from transaction or engagement data and used to group customers into segments. The segments are identified by analyzing the distribution of the RFM metrics and identifying cut-off points. RFM segmentation allows identifying the most valuable customers and those at high risk of churn for targeted marketing campaigns.
The document provides an overview of cluster analysis techniques. It discusses the need for segmentation to group large populations into meaningful subsets. Common clustering algorithms like k-means are introduced, which assign data points to clusters based on similarity. The document also covers calculating distances between observations, defining the distance between clusters, and interpreting the results of clustering analysis. Real-world applications of segmentation and clustering are mentioned such as market research, credit risk analysis, and operations management.
This document provides an overview of cluster analysis techniques. It defines cluster analysis as classifying cases into homogeneous groups based on a set of variables. The document then discusses how cluster analysis can be used in marketing research for market segmentation, understanding consumer behaviors, and identifying new product opportunities. It outlines the typical steps to conduct a cluster analysis, including selecting a distance measure and clustering algorithm, determining the number of clusters, and validating the analysis. Specific clustering methods like hierarchical, k-means, and deciding the number of clusters using the elbow rule are explained. The document concludes with an example of conducting a cluster analysis in SPSS.
The document proposes an improved k-means clustering algorithm to address some limitations of the traditional k-means method. The improved algorithm handles mixed categorical and numeric data by converting categorical attributes to numeric values. It determines initial cluster centers using hierarchical clustering and chooses the optimal number of clusters k based on two new coefficients α and β. An analysis of patient record data from a healthcare database demonstrates that the improved k-means algorithm can identify an appropriate number of clusters while dealing with issues like mixed data types.
This document summarizes a research paper that proposes a method for mining association rules from geographical points of interest data. It describes experiments conducted on point of interest data from Luoyang, China. The experiments involved (1) generating transactional data by spatially clustering the points of interest and converting each cluster to a transaction, (2) applying a novel FP-Growth algorithm called FP-GCID to generate frequent itemsets from the transaction data, and (3) ranking the association rules by mean product of probabilities to identify interesting rules. The top rules showed relationships between types of points of interest that should be considered together for deployment, such as banks and entertainment being related to catering establishments.
Practical Data Science: Data Modelling and PresentationHariniMS1
This document summarizes a student assignment to predict red wine quality using classification models. It describes using the wine quality dataset from UCI, preprocessing the data, exploring it visually, and training KNN and decision tree classifiers to predict wine quality. Evaluation shows the decision tree model achieved slightly higher accuracy than KNN, particularly when standard scaling was applied during modeling.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkStats Statswork
This document provides an overview of cluster analysis techniques used in marketing research. It defines cluster analysis as classifying cases into homogeneous groups based on a set of variables. Cluster analysis can be used for market segmentation, understanding buyer behaviors, and identifying new product opportunities in marketing research. The document outlines the steps to conduct cluster analysis, including selecting a distance measure and clustering algorithm, determining the number of clusters, and validating the analysis. It provides examples of hierarchical and non-hierarchical clustering methods like k-means and discusses choosing between these approaches. SPSS is used to demonstrate a cluster analysis example analyzing supermarket customer data.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
The team evaluated various machine learning classifiers on the MNIST handwritten digits dataset. They found that preprocessing like de-skewing improved classifier accuracy. Dimensionality reduction using PCA captured most variance with around 50 components. Linear classifiers achieved around 85% accuracy, while KNN and neural networks performed best at 97% accuracy. Deskewing helped reduce confusion between certain digits for all classifiers.
The document provides an overview of different machine learning algorithms used to predict house sale prices in King County, Washington using a dataset of over 21,000 house sales. Linear regression, neural networks, random forest, support vector machines, and Gaussian mixture models were applied. Neural networks with 100 hidden neurons performed best with an R-squared of 0.9142 and RMSE of 0.0015. Random forest had an R-squared of 0.825. Support vector machines achieved 73% accuracy. Gaussian mixture modeling clustered homes into three groups and achieved 49% accuracy.
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...pandavaTirumala
This document discusses detecting fraudulent water consumption behavior using data mining models. It proposes using decision tree and Bayesian classification techniques to analyze customer usage data and identify abnormal patterns indicative of fraud. The existing system causes losses from non-technical water losses. The proposed system focuses on applying decision trees and Bayesian classifiers to historical meter data to create a model for identifying suspicious fraudulent customers based on their water usage patterns.
This document discusses detecting fraudulent water consumption behavior using data mining models. It proposes using decision tree and Bayesian classification techniques to analyze customer usage data and identify abnormal patterns indicative of fraud. The existing system causes losses from non-technical water losses. The proposed system focuses on applying decision trees and Bayesian classifiers to historical meter data to create a model for identifying suspicious fraudulent customers based on their water usage patterns.
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...Fabricio de França
The document proposes an artificial immune network called dopt-aiNet for solving multimodal optimization problems in dynamic environments. dopt-aiNet is inspired by the immune system and uses clonal selection, mutation, and suppression techniques to maintain diversity and track moving optima. Numerical experiments show that dopt-aiNet outperforms other algorithms in terms of accuracy, convergence speed, and ability to track changing optima using fewer function evaluations. The paper discusses areas for future work such as improving suppression algorithms and studying the impact of different mutation operators.
Get involved with the steps of Kmeans and Hierarchical clustering and also understand how scaling affects the clustering with Agglomerative and Divise modes.
Do let me know if anything is required. Ping me at google #bobrupakroy
This document describes an analysis of forest cover type data using three decision tree algorithms: Naive Bayes Tree, Reduced Error Pruning Tree, and J48 Tree. The goal is to determine which algorithm yields the highest classification accuracy and the optimal parameter settings and bin sizes for each algorithm. The data is explored and preprocessed, including binning real-valued features. Experiments are conducted to evaluate accuracy on training and test sets for different parameter settings and bin sizes. Results show bin sizes between 20-50 yield higher accuracy, and optimal training set parameters are not necessarily optimal for the test set.
Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. It is widely used in data mining applications. The k-means algorithm is one of the simplest clustering algorithms that partitions data into k predefined clusters, where each data point belongs to the cluster with the nearest mean. It works by assigning data points to their closest cluster centroid and recalculating the centroids until clusters stabilize. The k-medoids algorithm is similar but uses actual data points as centroids instead of means, making it more robust to outliers.
This document is a machine learning class assignment submitted by Trushita Redij to their supervisor Abhishek Kaushik at Dublin Business School. The assignment discusses data preprocessing techniques, decision trees, the Chinese Restaurant algorithm, and building supervised learning models. Specifically, linear regression and KNN classification models are implemented on population data from Ireland to predict total population and classify countries.
The document discusses using OpenCL to accelerate genomic analysis through parallelization. It introduces OpenCL and provides examples of using it to parallelize algorithms for copy number inference in tumors, computing relatedness between individuals, and performing variable selection in regression. Key applications discussed include hidden Markov models for copy number inference, principal component analysis on relatedness matrices, and coordinate descent algorithms for lasso regression. Performance gains of up to 155x are reported for the parallel implementations compared to serial code.
This document discusses and compares five predictive data mining techniques: principal component analysis, correlation coefficient analysis, principal component regression, nonlinear partial least squares, and linear regression. It first provides background on data acquisition, preparation, and preprocessing techniques. It then describes each predictive technique, including how they handle issues like collinearity in datasets. Finally, it discusses how these techniques will be applied to four different datasets and the results compared to determine which technique best predicts the response variable while reducing variables.
This document describes using decision trees and linear regression for a statistical learning project on housing data. It discusses building decision trees and regression trees on latitude, longitude and other variables to predict housing prices. Linear regression performs poorly with an R-squared of 0.24, while regression trees more accurately identify areas with above-median home values. Further optimizing the regression tree with additional variables like income and population improves the model fit and predictions.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
Database Marketing - Dominick's stores in Chicago districDemin Wang
Determined two courses for the Dominick's transnational database analysis: one performed on a corporate level to facilitate a variety of corporate planning activities; and the other one on a category level to improves sales performance and expand product offerings.
• Extracted one year sales data from 109 Dominick's stores in Chicago district and merged with store demographic data.
• Analysis the data by segmentation analysis (create groups of the stores similar in performance), response analysis (find targetable characteristics of identified groups of stores) and model validation (evaluate performance of the model on a 20% hold-out sample) utilizing SAS
• Explicated the result in 25 pages report, which discussed the evaluation of potential locations for a new store and choice of the stores to test market a new product.
The document proposes an improved k-means clustering algorithm to address some limitations of the traditional k-means method. The improved algorithm handles mixed categorical and numeric data by converting categorical attributes to numeric values. It determines initial cluster centers using hierarchical clustering and chooses the optimal number of clusters k based on two new coefficients α and β. An analysis of patient record data from a healthcare database demonstrates that the improved k-means algorithm can identify an appropriate number of clusters while dealing with issues like mixed data types.
This document summarizes a research paper that proposes a method for mining association rules from geographical points of interest data. It describes experiments conducted on point of interest data from Luoyang, China. The experiments involved (1) generating transactional data by spatially clustering the points of interest and converting each cluster to a transaction, (2) applying a novel FP-Growth algorithm called FP-GCID to generate frequent itemsets from the transaction data, and (3) ranking the association rules by mean product of probabilities to identify interesting rules. The top rules showed relationships between types of points of interest that should be considered together for deployment, such as banks and entertainment being related to catering establishments.
Practical Data Science: Data Modelling and PresentationHariniMS1
This document summarizes a student assignment to predict red wine quality using classification models. It describes using the wine quality dataset from UCI, preprocessing the data, exploring it visually, and training KNN and decision tree classifiers to predict wine quality. Evaluation shows the decision tree model achieved slightly higher accuracy than KNN, particularly when standard scaling was applied during modeling.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkStats Statswork
This document provides an overview of cluster analysis techniques used in marketing research. It defines cluster analysis as classifying cases into homogeneous groups based on a set of variables. Cluster analysis can be used for market segmentation, understanding buyer behaviors, and identifying new product opportunities in marketing research. The document outlines the steps to conduct cluster analysis, including selecting a distance measure and clustering algorithm, determining the number of clusters, and validating the analysis. It provides examples of hierarchical and non-hierarchical clustering methods like k-means and discusses choosing between these approaches. SPSS is used to demonstrate a cluster analysis example analyzing supermarket customer data.
One of the most important, yet often overlooked, aspects of predictive modeling is the transformation of data to create model inputs, better known as feature engineering (FE). This talk will go into the theoretical background behind FE, showing how it leverages existing data to produce better modeling results. It will then detail some important FE techniques that should be in every data scientist’s tool kit.
The team evaluated various machine learning classifiers on the MNIST handwritten digits dataset. They found that preprocessing like de-skewing improved classifier accuracy. Dimensionality reduction using PCA captured most variance with around 50 components. Linear classifiers achieved around 85% accuracy, while KNN and neural networks performed best at 97% accuracy. Deskewing helped reduce confusion between certain digits for all classifiers.
The document provides an overview of different machine learning algorithms used to predict house sale prices in King County, Washington using a dataset of over 21,000 house sales. Linear regression, neural networks, random forest, support vector machines, and Gaussian mixture models were applied. Neural networks with 100 hidden neurons performed best with an R-squared of 0.9142 and RMSE of 0.0015. Random forest had an R-squared of 0.825. Support vector machines achieved 73% accuracy. Gaussian mixture modeling clustered homes into three groups and achieved 49% accuracy.
Detection Of Fraudlent Behavior In Water Consumption Using A Data Mining Base...pandavaTirumala
This document discusses detecting fraudulent water consumption behavior using data mining models. It proposes using decision tree and Bayesian classification techniques to analyze customer usage data and identify abnormal patterns indicative of fraud. The existing system causes losses from non-technical water losses. The proposed system focuses on applying decision trees and Bayesian classifiers to historical meter data to create a model for identifying suspicious fraudulent customers based on their water usage patterns.
This document discusses detecting fraudulent water consumption behavior using data mining models. It proposes using decision tree and Bayesian classification techniques to analyze customer usage data and identify abnormal patterns indicative of fraud. The existing system causes losses from non-technical water losses. The proposed system focuses on applying decision trees and Bayesian classifiers to historical meter data to create a model for identifying suspicious fraudulent customers based on their water usage patterns.
An Artificial Immune Network for Multimodal Function Optimization on Dynamic ...Fabricio de França
The document proposes an artificial immune network called dopt-aiNet for solving multimodal optimization problems in dynamic environments. dopt-aiNet is inspired by the immune system and uses clonal selection, mutation, and suppression techniques to maintain diversity and track moving optima. Numerical experiments show that dopt-aiNet outperforms other algorithms in terms of accuracy, convergence speed, and ability to track changing optima using fewer function evaluations. The paper discusses areas for future work such as improving suppression algorithms and studying the impact of different mutation operators.
Get involved with the steps of Kmeans and Hierarchical clustering and also understand how scaling affects the clustering with Agglomerative and Divise modes.
Do let me know if anything is required. Ping me at google #bobrupakroy
This document describes an analysis of forest cover type data using three decision tree algorithms: Naive Bayes Tree, Reduced Error Pruning Tree, and J48 Tree. The goal is to determine which algorithm yields the highest classification accuracy and the optimal parameter settings and bin sizes for each algorithm. The data is explored and preprocessed, including binning real-valued features. Experiments are conducted to evaluate accuracy on training and test sets for different parameter settings and bin sizes. Results show bin sizes between 20-50 yield higher accuracy, and optimal training set parameters are not necessarily optimal for the test set.
Clustering is an unsupervised learning technique used to group unlabeled data points into clusters based on similarity. It is widely used in data mining applications. The k-means algorithm is one of the simplest clustering algorithms that partitions data into k predefined clusters, where each data point belongs to the cluster with the nearest mean. It works by assigning data points to their closest cluster centroid and recalculating the centroids until clusters stabilize. The k-medoids algorithm is similar but uses actual data points as centroids instead of means, making it more robust to outliers.
This document is a machine learning class assignment submitted by Trushita Redij to their supervisor Abhishek Kaushik at Dublin Business School. The assignment discusses data preprocessing techniques, decision trees, the Chinese Restaurant algorithm, and building supervised learning models. Specifically, linear regression and KNN classification models are implemented on population data from Ireland to predict total population and classify countries.
The document discusses using OpenCL to accelerate genomic analysis through parallelization. It introduces OpenCL and provides examples of using it to parallelize algorithms for copy number inference in tumors, computing relatedness between individuals, and performing variable selection in regression. Key applications discussed include hidden Markov models for copy number inference, principal component analysis on relatedness matrices, and coordinate descent algorithms for lasso regression. Performance gains of up to 155x are reported for the parallel implementations compared to serial code.
This document discusses and compares five predictive data mining techniques: principal component analysis, correlation coefficient analysis, principal component regression, nonlinear partial least squares, and linear regression. It first provides background on data acquisition, preparation, and preprocessing techniques. It then describes each predictive technique, including how they handle issues like collinearity in datasets. Finally, it discusses how these techniques will be applied to four different datasets and the results compared to determine which technique best predicts the response variable while reducing variables.
This document describes using decision trees and linear regression for a statistical learning project on housing data. It discusses building decision trees and regression trees on latitude, longitude and other variables to predict housing prices. Linear regression performs poorly with an R-squared of 0.24, while regression trees more accurately identify areas with above-median home values. Further optimizing the regression tree with additional variables like income and population improves the model fit and predictions.
The document provides an overview of clustering methods and algorithms. It defines clustering as the process of grouping objects that are similar to each other and dissimilar to objects in other groups. It discusses existing clustering methods like K-means, hierarchical clustering, and density-based clustering. For each method, it outlines the basic steps and provides an example application of K-means clustering to demonstrate how the algorithm works. The document also discusses evaluating clustering results and different measures used to assess cluster validity.
Database Marketing - Dominick's stores in Chicago districDemin Wang
Determined two courses for the Dominick's transnational database analysis: one performed on a corporate level to facilitate a variety of corporate planning activities; and the other one on a category level to improves sales performance and expand product offerings.
• Extracted one year sales data from 109 Dominick's stores in Chicago district and merged with store demographic data.
• Analysis the data by segmentation analysis (create groups of the stores similar in performance), response analysis (find targetable characteristics of identified groups of stores) and model validation (evaluate performance of the model on a 20% hold-out sample) utilizing SAS
• Explicated the result in 25 pages report, which discussed the evaluation of potential locations for a new store and choice of the stores to test market a new product.
2. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
2
3. 1. OBJECTIVE
A. Creation of 2 sets of clusters: K-Means & Hierarchial
B. The clusters should be based on mix of sales by:
i. Category and
ii. Avg. sales per sq. foot of space
3
4. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
4
5. 2. METHODOLOGY
a. Exploratory Data Analysis
The MEANS Procedure
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
Cat1 Cat1 515 0 120.00 231.82 340.00 66.61 119386.00
Cat2 Cat2 515 0 52.00 150.82 247.00 56.66 77672.00
Cat3 Cat3 515 0 33.00 81.60 212.00 28.44 42022.00
Cat4 Cat4 515 0 90.00 134.37 166.00 20.21 69201.00
Sale Sale 515 0 380.00 598.60 838.00 83.49 308281.00
Size Size 515 0 1200.00 2933.45 3650.00 437.20 1510725.00
Avg_Sales 515 0 0.11 0.21 0.50 0.05 108.34
• Since the given variables, Cat1 – Cat4 are in absolute terms, additional variables PCAT1 – PCAT4 were
calculated next as percentages to understand them better as relative variables
• Avg_Sales was also calculated as an additional variable
• Avg_Sales = Sale / Size 5
6. METHODOLOGY
a. Exploratory Data Analysis
Overall analysis
a. Sales from Category 1 are the highest amongst all the four categories of sales. Hence,
Category 1 is the dominating category.
b. However, the standard deviation in the amount of sales from Category 1 is also the
highest amongst all four categories of Sales.
c. The standard deviation in Size of the stores is 437.20 which is on the higher side.
d. The mean size of the stores in both states is 2933 sq feet, the maximum being 3650 sq
feet
e. Assuming that the Sale figures are in '000, the average sale figure per sq foot across all
categories in all stores is 210
6
7. 2. METHODOLOGY
a. Exploratory Data Analysis
SAS Code:
**Creating additional variable: 'avg. sale per sq. foot' , PCAT1 PCAT2
PCAT3 PCAT4** ;
Data Stores_1 ;
Set Stores ;
Avg_Sales = Sale / Size ;
Run;
Data Stores_1 ;
Set Stores_1 ;
PCAT1 = (Cat1 / (Cat1+Cat2+Cat3+Cat4))*100 ;
PCAT2 = (Cat2 / (Cat1+Cat2+Cat3+Cat4))*100 ;
PCAT3 = (Cat3 / (Cat1+Cat2+Cat3+Cat4))*100 ;
PCAT4 = (Cat4 / (Cat1+Cat2+Cat3+Cat4))*100 ;
Run;
7
8. 2. METHODOLOGY
a. Exploratory Data Analysis
SAS Code:
**Perform EDA** ;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales;
Run;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales;
Run;
Proc Sort Data = Stores_1 ;
By State ;
Run;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var Cat1 Cat2 Cat3 Cat4 Sale Size Avg_Sales;
By State ;
Run;
8
9. 2. METHODOLOGY
a. Exploratory Data Analysis
SAS Code:
**Perform EDA** ;
Proc Means Data = Stores_1 N NMiss Min Mean Max Std Sum ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Sale Size Avg_Sales;
By State ;
Run;
Proc FREQ Data = Stores_1 ;
Table State ;
Run ;
9
10. METHODOLOGY
a. Exploratory Data Analysis
State=KA
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81
PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84
PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52
PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84
Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00
Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00
Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83
State=TN
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26
PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24
PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32
PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18
Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00
Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00
Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51
A State-wise analysis of the variables reveals more or less the same patterns for both the states, KA & TN.
Category 1 remains the dominating category across both the states
Although the average size of stores in both states is roughly the same, a comparison of the minimum store size in both the states shows that there are a few smaller stores in
state TN as compared to state KA. 10
11. METHODOLOGY
a. Exploratory Data Analysis
State=KA
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 282 0 20.44 38.19 62.91 8.91 10769.81
PCAT2 282 0 9.05 25.00 44.18 8.17 7049.84
PCAT3 282 0 4.76 13.73 33.07 5.01 3872.52
PCAT4 282 0 12.67 23.08 37.16 4.83 6507.84
Sale Sale 282 0 380.00 594.77 838.00 83.67 167724.00
Size Size 282 0 1550.00 2935.23 3650.00 424.51 827735.00
Avg_Sales 282 0 0.12 0.21 0.45 0.05 58.83
State=TN
Variable Label N nmiss Minimum Mean Maximum Std Dev Sum
PCAT1 233 0 20.58 38.70 59.25 8.67 9016.26
PCAT2 233 0 9.06 24.86 40.86 8.37 5791.24
PCAT3 233 0 4.86 13.81 25.23 4.72 3217.32
PCAT4 233 0 12.05 22.64 38.03 4.56 5275.18
Sale Sale 233 0 395.00 603.25 796.00 83.21 140557.00
Size Size 233 0 1200.00 2931.29 3650.00 452.99 682990.00
Avg_Sales 233 0 0.11 0.21 0.50 0.05 49.51
The ranking of the four category of products at these stores remains the same in both the states i.e., Sales of Cat1 > Cat2 > Cat4 > Cat3
The mean sale in state TN is higher than that in state KA though not significantly. This is due a lower count of stores in TN as compared to KA as a result of which
TN has a slightly higher mean sales inspite of having lower sales overall.
The total count of stores in state KA is higher (55%) than that in state TN (45%).
The total volume of sales in state KA is higher than that in state TN which is on expected lines given the higher count of stores in KA as compared to TN.
It may therefore be inferred that there are possibly a few stores in state TN that are smaller than the mean size of stores in both states and the average sale per sq. foot in these stores is high.
The average sale per sq. foot is roughly the same in both the states.
11
12. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
12
13. 2. METHODOLOGY
b. Data Preparation
i. Scaling
The data was scaled i.e., the following variables were normalised in order to bring them to a comparable level:
a. PCAT1
b. PCAT2
c. PCAT3
d. PCAT4
e. Avg_Sales
SAS Code:
**SCALING in order to standardize the variables** ;
Proc Standard Data = Stores_1 Mean = 0 Std = 1 Out = Store_2;
Var PCAT1-PCAT4 Avg_Sales;
Run;
13
14. 2. METHODOLOGY
b. Data Preparation
ii Weighting
The variable ‘Avg Sales Per Sq. Foot’ was weighted with several iterations as follows:
Summary of the results of the weighting iterations performed above:
(Detailed results for all the iterations performed above are available on the path: ‘Y:Assignment - ClusteringWeighting’)
Iteration # Weight Assigned
1 2
2 3
3 4
4 5
Cluster Summary: Iteration 1 W=2
Cluster Frequency
RMS Std
Deviation
Maximum Distance
from Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 127 0.8789 4.2013 3 3.3953 3.86
2 3 1.1296 2.3982 1 7.4713 6.61
3 184 0.7815 3.2693 4 2.5422 3.25
4 201 0.9511 4.6159 3 2.5422 2.67
Cluster Summary: Iteration 2 W=3
Cluster Frequency
RMS Std
Deviation
Maximum
Distance from
Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 218 0.9394 4.5008 2 3.0989 3.30
2 212 1.0469 3.97 1 3.0989 2.96
3 82 0.9892 4.6592 1 4.5079 4.56
4 3 1.2361 2.6172 3 10.062 8.14
14
15. 2. METHODOLOGY
b. Data Preparation
Cluster Summary: Iteration 3 W=4
Cluster Frequency
RMS Std
Deviation
Maximum Distance
from Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 3 1.3713 2.8962 4 13.3713 9.75
2 226 1.1425 4.8821 3 4.0108 3.51
3 205 0.9991 4.4853 2 4.0108 4.01
4 81 1.1858 5.6984 3 5.8659 4.95
Cluster Summary: Iteration 4 W=5
Cluster Frequency
RMS Std
Deviation
Maximum
Distance from
Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids Ratio
1 202 1.0857 4.4648 2 4.95 4.56
2 229 1.241 5.8923 1 4.95 3.99
3 3 1.5277 3.2196 4 16.71 10.94
4 81 1.4006 6.8317 1 7.28 5.20
The Ratio mentioned above has been calculated using the Difference in Centroids (M) method where:
M = D / d1
D = Average distance b/w cluster centroids
d1 = Average distance b/w cluster members and centroid
15
16. 2. METHODOLOGY
b. Data Preparation
SAS Code:
*1. Iteration 1 : Weight = 2* ;
Data Store_3 ;
Set Store_2 ;
Avg_Sales2 = Avg_Sales*2 ;
Run;
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_3 Out = Cluster_1 Maxclusters = 4 Converge = 0 Maxiter = 20 ;
Var PCAT1-PCAT4 Avg_Sales2;
Run;
*2. Iteration 2 : Weight = 3* ;
Data Store_4 ;
Set Store_3 ;
Avg_Sales3 = Avg_Sales*3 ;
Run;
16
17. 2. METHODOLOGY
b. Data Preparation
SAS Code:
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_4 Out = Cluster_2 Maxclusters = 4 Converge = 0 Maxiter = 20 ;
Var PCAT1-PCAT4 Avg_Sales3;
Run;
*3. Iteration 3 : Weight = 4* ;
Data Store_5 ;
Set Store_4 ;
Avg_Sales4 = Avg_Sales*4 ;
Run;
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_5 Out = Cluster_3 Maxclusters = 4 Converge = 0 Maxiter = 20 ;
Var PCAT1-PCAT4 Avg_Sales4;
Run; 17
18. 2. METHODOLOGY
b. Data Preparation
SAS Code:
*3. Iteration 4 : Weight = 5* ;
Data Store_6 ;
Set Store_5 ;
Avg_Sales5 = Avg_Sales*5 ;
Run;
**Running the clustering procedure based on K-Means** ;
Proc FastClus Data = Store_6 Out = Cluster_4 Maxclusters = 4 Converge = 0 Maxiter =
20 ;
Var PCAT1-PCAT4 Avg_Sales5;
Run;
18
19. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
19
20. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS) : Creation of Preliminary Clusters
For detailed results of the Preliminary Cluster analysis and dignostic plots, please refer to the path: Y:Assignment -
ClusteringPreliminary_Analysis_Outliers.xlsx
Cluster Summary
Cluster Frequency RMS Std Deviation
Maximum Distance from
Seed to Observation Radius Exceeded Nearest Cluster
Distance Between
Cluster Centroids
1 14 0.7555 2.7913 14 2.1932
2 32 0.6678 3.2557 5 1.9093
3 1. 0 11 4.3786
4 60 0.8155 3.286 19 2.2772
5 28 0.6731 2.5255 2 1.9093
6 61 0.7495 2.7836 19 2.6713
7 1. 0 13 4.3876
8 67 0.7811 2.8286 10 2.2277
9 42 0.7811 2.4529 4 2.8634
10 46 0.6996 2.5278 8 2.2277
11 29 0.7186 2.5871 5 2.2318
12 1. 0 13 3.9468
13 1. 0 12 3.9468
14 28 0.6919 2.5985 1 2.1932
15 5 0.6989 1.9953 18 2.3899
16 21 0.6852 2.6402 18 2.1146
17 27 0.6957 2.2399 5 2.198
18 9 0.6001 1.7932 16 2.1146
19 29 0.7757 2.7927 4 2.2772
20 13 0.6923 2.3191 16 2.3297
Hence, Cluster # 3, 7, 12 and 13 appear as outliers with only single observation in each.
The remaining clusters appear to be reasonably sized.
20
21. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS)
The following are the details of the clusters that have been identified as outliers: Detection of Outliers
Store_Num CLUSTER
36 3
225 7
360 12
179 13
Cluster=3
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 33.39 38.42 8.80 0.57
PCAT2 12.32 24.93 8.26 1.53
PCAT3 33.07 13.77 4.88 3.96
PCAT4 21.22 22.88 4.71 0.35
Avg_Sales 0.22 0.21 0.05 0.26
Cluster=7
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 40.14 38.42 8.80 0.20
PCAT2 32.03 24.93 8.26 0.86
PCAT3 4.76 13.77 4.88 1.85
PCAT4 23.08 22.88 4.71 0.04
Avg_Sales 0.45 0.21 0.05 4.50
Cluster=12
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 40.83 38.42 8.80 0.27
PCAT2 31.60 24.93 8.26 0.81
PCAT3 15.47 13.77 4.88 0.35
PCAT4 12.09 22.88 4.71 2.29
Avg_Sales 0.50 0.21 0.05 5.50
Cluster=13
Variable Mean Population Mean Population Std Dev Z-Score
PCAT1 45.55 38.42 8.80 0.81
PCAT2 15.66 24.93 8.26 1.12
PCAT3 20.11 13.77 4.88 1.30
PCAT4 18.68 22.88 4.71 0.89
Avg_Sales 0.47 0.21 0.05 4.91
21
22. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
Store_Num CLUSTER Cat1 Cat2 Cat3 Cat4 Size Sale State Avg_Sales PCAT1 PCAT2 PCAT3 PCAT4
36 3 214 79 212 136 2860 641KA 224.13 33.39 12.32 33.07 21.22
225 7 287 229 34 165 1600 715KA 446.88 40.14 32.03 4.76 23.08
360 12 314 243 119 93 1540 769TN 499.35 40.83 31.60 15.47 12.09
179 13 256 88 113 105 1200 562TN 468.33 45.55 15.66 20.11 18.68
• Average size of all the stores in the data set is 2933 sq. feet. Thus for store # 225, 360 & 179 the size is considerably less.
• Avg Sales Per Sq. Foot for all the stores is 210 whereas for Store # 225, 360 & 179 it is more than double the overall mean
avg sales per sq. foot. This is due to the smaller size of these stores as compared to the size of all other stores.
• For Store # 36 the sales from CAT3 has a percentage share 33% of the total sales from that store. Whereas, the average
percentage share of CAT3 in all the stores is appox 14%
22
25. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
**Performing the Clustering procedure using K-Means with iterations to determine the optimal no. of clusters** ;
*Conducting a preliminary cluster analysis to detect outliers, if any* ;
Proc Fastclus Data = Store_6 Out = Cluster_Prelim Maxclusters = 20 Converge = 0 Outstat=Stat_Prelim_0;
Var PCAT1-PCAT4 Avg_Sales5;
Run;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run; 25
26. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Cluster_Prelim ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
*Preparation of data set obtained from merging procedures in order to make a cluster wise analysis of
the outliers, if any* ;
Proc Sort Data = Cluster_Prelim ;
By Cluster;
Run;
Data Cluster_Pre_1 ;
Set Cluster_Prelim ;
Keep Store_Num Cluster ;
Run;
26
27. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
Proc Export Data = Cluster_Pre_1 outfile = 'Y:Assignment - ClusteringCluster_Pre_1.csv'
DBMS=CSV Replace ;
Run;
*Merging data set named Cluster_Pre_1 with data set Stores_1* ;
Proc Sort Data = Cluster_Pre_1 ;
By Store_Num ;
Run;
Proc Sort Data = Stores_1 ;
By Store_Num;
Run;
Data Store_1_Merged ;
Merge Cluster_Pre_1 (in=a) Stores_1 (in=b) ;
By Store_Num ;
If a and b ;
Run;
Proc Export Data = Store_1_Merged Outfile = 'Y:Assignment - ClusteringStore_1_Merged.csv'
DBMS = CSV Replace ;
Run; 27
28. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers
SAS Code:
Proc Sort Data = Store_1_Merged ;
By Cluster ;
Run;
Proc Means Data = Store_1_Merged Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ;
By Cluster ;
Where Cluster IN(3,7,12,13) ;
Run;
Proc Means Data = Stores_1 Mean Std ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales ;
Run;
Proc Means Data = Stores_1 Mean ;
Var Size Avg_Sales ;
Run;
28
29. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
An alternative approach for detection and treatment of outliers was attempted.
The following are the steps that were undertaken for the process of detection and treatment of outliers:
STEP 1: Run Proc FASTCLUS with many clusters and OUTSEED = output data set for diagnostic plot
(Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step1_Mean1.xlsx’)
29
30. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
STEP 2: Remove low frequency clusters
The data set, MEAN1, generated in the above step was used to remove low frequency clusters ( < 5) and clusters with a
frequency of 5 or more were retained for subsequent analysis.
The data set with clusters having 5 or more frequency was named as 'Seed1'.
STEP 3: Proc FASTCLUS was run again selecting seeds from high frequency clusters obtained in data set SEED1 in Step 2
above using LEAST = 1 Clustering Criterion
Value for LEAST should be < 2 in order to reduce the effect of outliers on cluster centers
(Detailed results are available on the following path: ‘Y:Assignment - ClusteringPrelim_Analysis_Step3_LEAST.xlsx’)
STEP 4: Proc FASTCLUS was run again selecting seeds from high frequency clusters in previous analysis with STRICT=3
preventing outliers from distorting the results
Value of STRICT = 3 was chosen to be close to _GAP_ & _RADIUS_ values of the larger clusters in the diagnostic plots.
30
31. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
However, STRICT option is not supported in WPS for Proc FASTCLUS in the present version.
Subsequently, a final Proc FASTCLUS could not be run to assign outliers and tails to clusters using seeds that would have been
generated from using STRICT option above.
SAS Code:
***Another method for identification and treatment of outliers*** ;
*STEP 1 : Run PROC FASTCLUS with many clusters and OUTSEED = output data set for
diagnostic plot*;
Proc Fastclus Data = Store_6 Outseed = Mean1 Maxclusters = 20 Maxiter = 0 Summary ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Axis1 Label = (Angle=90 Rotate=0) Minor=None Order=(0 to 10 by 2) ;
Axis2 minor = None ;
Proc Gplot Data = Mean1 ;
Plot _GAP_*_FREQ_ _RADIUS_*_FREQ_ / Overlay Frame
cframe = ligr vaxis = axis1 haxis=axis2 legend= legend1 ;
Run; 31
32. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approach
SAS Code:
*Step 2 :Remove Low Frequency clusters* ;
Data Seed1 ;
Set Mean1 ;
If _FREQ_ >=5 ;
Run;
*Step 3 : Run Proc Fastclus again selecting seeds from high frequency clusters in previous analysis using
LEAST = 1 Clustering Criterion since value < 2 reduce the effect of outliers on cluster centers* ;
Proc FASTCLUS Data = Store_6 Seed = Seed1 Maxclusters = 8 Least = 1 Out = Store_6_Least ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Legend1 Frame Cframe = ligr Label = None CBorder = Black
Position=Center Value= (Justify=Center) ;
Axis1 Label =(Angle=90 Rotate=0) Minor=None ;
Axis2 Minor=None ;
Proc Gplot Data = Store_6_Least ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
32
33. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Detection of Outliers – an
alternative approachSAS Code:
Proc Gplot Data = Store_6_Least ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Store_6_Least ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Store_6_Least ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe=ligr
Legend=Legend1 vaxis=axis1 haxis=axis2 ;
Run;
*Step 4 : Run Proc Fastclus again, selecting seeds from high frequency clusters in previous analysis with STRICT = to
prevent the outliers from distorting the
results*
*Value of STRICT = is chosen to be close to the _GAP_ & _RADIUS_ values of the large clusters in the diagnostic plot* ;
Proc Fastclus Data = Store_6 Seed = Seed1 Maxclusters = 8 Strict=3 out = Store_6_Strict Outseed = Mean2 ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run; 33
34. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
Performing iterations for determining the appropriate no. of clusters using K-Means (PROC FASTCLUS)
From the procedure run in Step 1 of the alternative method discussed in the preceeding slide for outlier detection, it was found that
8 and above could be a good no. for meaningful cluster formation.
Hence, the iterations below begin with Maxclusters = 8-10
Clustering is performed on the data set from which the outliers have been removed.
Iteration 1 : Maxclusters = 8
Statistic used for comparison
Name (as it
appears in the
output) Value_Iteration1
Pseudo F Stat Pseudo F Statistics 451.71
Appox. Expected overall R^2 Overall R-Square 0.84
Detailed output and plots on path Y:Assignment -
ClusteringIteration_1_Maxclust_8.xlsx
Iteration 2 : Maxclusters = 9
Statistic used for comparison
Name (as it
appears in the
output) Value_Iteration2
Pseudo F Stat Pseudo F Statistics 420.29
Appox. Expected overall R^2 Overall R-Square 0.87
Detailed output and plots on path Y:Assignment -
ClusteringIteration_2_Maxclust_9.xlsx
Iteration 3 : Maxclusters = 10
Statistic used for comparison
Name (as it
appears in the
output) Value_Iteration3
Pseudo F Stat Pseudo F Statistics 391.10
Appox. Expected overall R^2 Overall R-Square 0.88
Detailed output and plots on path Y:Assignment -
ClusteringIteration_3_Maxclust_10.xlsx
Points considered for a comparison of the above 3 iterations:
1 Relatively large values of Pseudo F Stat indicate a stopping point
2 Higher values of overall R-Square are desirable
3 Increasing the no. of clusters although not much differentiation exists amongst the
iterations means devising more marketing strategies unique to each cluster.
Given a cost vs. benefit analysis, it is preferable to have a smaller no. of clusters.
Hence, iteration 2 wherein 9 clusters are formed seems most appropriate in the
present case. 34
35. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
*Deleting the outliers found (in the procedures above) from the scaled and weighted data set* ;
Data Store_6_Final ;
Set Store_6 ;
If Store_Num IN(36 179 225 360) Then Delete ;
Run;
*Iteration 1 : Maxclusters = 8 * ;
Proc FastClus Data = Store_6_Final Maxclusters = 8 Maxiter= 20 Converge = 0 Out=Clusters_8 ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
*Generating Plots of clusters*;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
35
39. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
*Generating Plots of clusters*;
Legend1 Frame Cframe = Ligr Label=none Cborder=Black
Position = Center Value = (Justify=Center) ;
Axis1 Label = (Angle=90 Rotate=0) Minor=None ;
Axis2 Minor = None ;
Proc Gplot Data = Clusters_10 ;
Plot PCAT1*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_10 ;
Plot PCAT2*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
39
40. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
Proc Gplot Data = Clusters_10 ;
Plot PCAT3*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
Proc Gplot Data = Clusters_10 ;
Plot PCAT4*Avg_Sales5 = Cluster / Frame Cframe = Ligr
Legend = Legend1 vaxis = axis1 haxis=axis2 ;
Run;
*Merging the data sets for analysis of the final clusters formed* ;
Proc Sort Data = Stores_1 ;
By Store_Num ;
Run;
Data Stores_1_Final ;
Set Stores_1 ;
If Store_Num IN(36 179 225 360) Then Delete ;
Run;
40
41. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Clustering of the data set treated for outliers
SAS Code:
Data Cluster_9_Final ;
Set Clusters_9 ;
Keep Store_Num Cluster ;
Run;
Proc Sort Data = Cluster_9_Final ;
By Store_Num ;
Run;
Data Stores_1_Final_Merged ;
Merge Stores_1_Final (in=a) Cluster_9_Final (in=b);
By Store_Num ;
If a and b ;
Run;
41
42. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
42
43. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters
Cluster Summary
Cluster Frequency RMS Std Deviation
Maximum
Distance from
Seed to
Observation Radius Exceeded
Nearest
Cluster
Distance
Between Cluster
Centroids Ratio
1 46 0.8678 3.424 3 3.3888 3.91
2 57 0.9084 3.2855 6 3.7655 4.15
3 83 0.819 2.9486 5 2.8481 3.48
4 41 0.7917 2.8507 6 2.8616 3.61
5 99 0.8254 2.6116 3 2.8481 3.45
6 67 0.773 3.0078 4 2.8616 3.70
7 81 0.7999 2.9993 4 2.9345 3.67
8 7 0.8229 2.68 9 3.2939 4.00
9 30 0.7436 2.6002 8 3.2939 4.43
• Ratio has been calculated using the ‘Difference in Centroids’ method as D / d1 where:
D = Average distance b/w cluster centroids
d1 = Average distance b/w members and cluster centroid
• Thus, the ratio signifies the strength of the clusters formed and is a measure of the
homogeneity within compared to the heterogeneity outside
• Cluster 9 is the strongest of all other cluster formations followed by Cluster 2 & Cluster 8
43
44. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling of the clusters
The 9 clusters obtained in the preliminary cluster analysis have been evaluated and profiled as under in order to gain insights
into the variables that are most dominating in the cluster formation:(Detailed output on path ‘Y:Assignment –
ClusteringIteration_2_Maxclust_9.xlsx)
Cluster=1
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 46 39.92 38.41 8.82 0.17
PCAT2 46 27.07 24.95 8.25 0.26
PCAT3 46 12.58 13.73 4.80 0.24
PCAT4 46 20.42 22.91 4.70 0.53
Avg_Sales_Final 46 272.14 208.80 48.77 1.30
Cluster=2
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 57 33.41 38.41 8.82 0.57
PCAT2 57 20.16 24.95 8.25 0.58
PCAT3 57 16.95 13.73 4.80 0.67
PCAT4 57 29.49 22.91 4.70 1.40
Avg_Sales_Final 57 142.44 208.80 48.77 1.36
Cluster=3
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 83 39.70 38.41 8.82 0.15
PCAT2 83 26.23 24.95 8.25 0.16
PCAT3 83 13.58 13.73 4.80 0.03
PCAT4 83 20.49 22.91 4.70 0.52
Avg_Sales_Final 83 236.59 208.80 48.77 0.57
Cluster=4
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 41 31.45 38.41 8.82 0.79
PCAT2 41 20.68 24.95 8.25 0.52
PCAT3 41 21.16 13.73 4.80 1.55
PCAT4 41 26.71 22.91 4.70 0.81
Avg_Sales_Final 41 183.11 208.80 48.77 0.53
Cluster=5
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 99 37.08 38.41 8.82 0.15
PCAT2 99 28.40 24.95 8.25 0.42
PCAT3 99 12.62 13.73 4.80 0.23
PCAT4 99 21.90 22.91 4.70 0.22
Avg_Sales_Final 99 207.18 208.80 48.77 0.03
Cluster=6
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 67 31.57 38.41 8.82 0.78
PCAT2 67 33.61 24.95 8.25 1.05
PCAT3 67 10.78 13.73 4.80 0.61
PCAT4 67 24.04 22.91 4.70 0.24
Avg_Sales_Final 67 173.18 208.80 48.77 0.7344
45. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
Cluster=7
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 81 49.76 38.41 8.82 1.29
PCAT2 81 15.03 24.95 8.25 1.20
PCAT3 81 12.69 13.73 4.80 0.22
PCAT4 81 22.52 22.91 4.70 0.08
Avg_Sales_Final 81 183.23 208.80 48.77 0.52
Cluster=8
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 7 39.64 38.41 8.82 0.14
PCAT2 7 26.63 24.95 8.25 0.20
PCAT3 7 13.52 13.73 4.80 0.04
PCAT4 7 20.21 22.91 4.70 0.58
Avg_Sales_Final 7 351.02 208.80 48.77 2.92
Cluster=9
Variable N Mean Population Mean Population Std Dev Z-Score
PCAT1 30 40.27 38.41 8.82 0.21
PCAT2 30 28.77 24.95 8.25 0.46
PCAT3 30 12.71 13.73 4.80 0.21
PCAT4 30 18.25 22.91 4.70 0.99
Avg_Sales_Final 30 316.83 208.80 48.77 2.21
45
46. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
1. 7% of the total stores have their avg sales per sq. foot significantly higher than the overall average.
2. 11% of the total stores have significantly higher than overall average sales in the category of Tobacco &
Alcohol.
3. 16% of the total stores have lower than the overall average sales in the category of Tobacco & Alcohol
though the difference is not significant.
4. 32% of the total stores have higher than overall average sales in the category of Frozen Foods. Average
sales in Cluster 6 for the category of Frozen Foods is significantly higher than the overall mean sales for
the same category.
Cluster # No. of stores
8 7
9 30
Cluster # No. of stores
2 57
Cluster # No. of stores
3 83
Cluster # No. of stores
5 99
6 67
46
47. The FREQ Procedure
Table of CLUSTER by State
CLUSTER (Cluster) State (State) Total
KA TN
Frequency
Percent
1 30 16 46
5.87 3.13 9
2 33 24 57
6.46 4.7 11.15
3 44 39 83
8.61 7.63 16.24
4 25 16 41
4.89 3.13 8.02
5 49 50 99
9.59 9.78 19.37
6 37 30 67
7.24 5.87 13.11
7 43 38 81
8.41 7.44 15.85
8 3 4 7
0.59 0.78 1.37
9 16 14 30
3.13 2.74 5.87
Total
280 231 511
54.79 45.21 100
2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
The overall bias in the number of stores is towards the state KA with 55% of the total stores being in KA.
No other significant pattern in the distribution of stores has emerged.
47
48. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters
Analysis Variable: Size
Cluster N Obs N Mean
Population
Mean
Population
Std Dev Z-Score Minimum Maximum
1 46 46 2471.52 2942.32 423.52 1.11 1700 2910
2 57 57 3334.74 2942.32 423.52 0.93 2550 3650
3 83 83 2761.39 2942.32 423.52 0.43 1925 3330
4 41 41 3040.98 2942.32 423.52 0.23 2180 3650
5 99 99 2985.56 2942.32 423.52 0.10 2000 3610
6 67 67 3184.63 2942.32 423.52 0.57 2200 3630
7 81 81 3172.10 2942.32 423.52 0.54 2600 3650
8 7 7 1977.14 2942.32 423.52 2.28 1550 2150
9 30 30 2205.33 2942.32 423.52 1.74 1750 2520
Appox. 7% of all the stores have a mean size significantly lower than the overall size of all the stores.
The split of these stores b/w the two states is roughly the same and there is no discerning pattern
observed.
48
49. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9
clusters)
49
50. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Diagnostic Plots for Iteration 2 (9
clusters)
50
51. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters)
SAS Code:
*Profiling the 9 clusters obtained in the preceeding procedures* ;
Proc Sort Data = Stores_1_Final_Merged ;
By Cluster ;
Run;
Data Stores_1_Final_Merged ;
Set Stores_1_Final_Merged;
Avg_Sales_Final = Avg_Sales * 1000 ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
By Cluster ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean Std;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run; 51
52. 2. METHODOLOGY
c. Preliminary Cluster Analysis (PROC FASTCLUS): Evaluating & profiling the clusters – Iteration 2 (9 clusters)
SAS Code:
Proc Means Data = Stores_1_Final_Merged N ;
Class State ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run;
Proc Freq Data = Stores_1_Final_Merged ;
Tables Cluster * State / nocol norow ;
Run;
Proc Means Data = Stores_1_Final_Merged N Mean ;
Var Size ;
Class Cluster ;
Run;
Proc Means Data = Stores_1_Final_Merged Mean Std ;
Var Size ;
Run;
52
53. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
53
54. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER)
With 9 clusters, treated for outliers, obtained from the preliminary cluster analysis using PROC FASTCLUS procedure (K-Means
Method for Clustering), Hierarchial Clustering is performed next using the PROC CLUSTER procedure to obtain the final no. of
clusters.
The following methods are used for Hierarchial Clustering:
Note:
K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis.
Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set.
Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9.
S.No Method # of Clusters obtained Remarks
1 Ward's Method 3
The scatter diagram of the clusters obtained revealed cluster formations that were not well
demarcated
Also, the profiling of these 3 clusters didn't reveal any variable that was dominant in the formation of the
clusters.
For detailed results, refer to the tab named 'Output_Wards' &
'Output_Final_Profiling_W'
2 Density Method Ties were observed while Density method was used. Based on the position of the Ties in the Cluster History,
the clusters obtained when K=7 were finalized.
K=7 5 For detailed results, refer to the tab named 'Output_Density_K7' & 'Output_Final_Profiling_D'
K=8 4 For detailed results, refer to the tab named 'Output_Density_K8'.
K=9 5 For detailed results, refer to the tab named 'Output_Density_K9'.
54
55. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’ (refer to tab named ‘Output_Wards’)
Cluster History
Number Of Clusters First Cluster Joined
Second
Cluster
Jointed Frequency Of New Cluster Semipartial RSq RSquared Pseudo F Statistic Pseudo t-squared Approximate Expected RSq Cubic Clustering Criteria Tie
8 8 9 37 0.0054 0.9913,000 . . .
7 4 6 108 0.0184 0.98 3436. . .
6 1 3 129 0.0301 0.95 1772. . .
5 CL7 7 189 0.0322 0.91 1343 327. .
4 CL5 5 288 0.0438 0.87 1131 248. .
3 2 CL4 345 0.0952 0.77 874 346. .
2 CL6 CL8 166 0.1266 0.65 938 585. .
1 CL2 CL3 511 0.6483 0. 938 0 0
# of clusters according to:
Pseudo T-Square: 3, 2
Semipartial R-Square: 8,7,6,5,4,3
Therefore, final # of clusters considered on the basis of the results of Ward's Method = 3
The Cluster History, from the Ward’s method, is as below:
55
56. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The Tree diagram, from the Ward’s method, is as below:
56
57. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The following are the plots obtained from the Ward’s method:
57
58. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The following are the plots obtained from the Ward’s method:
58
59. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
The 3 clusters obtained from the Ward’s method have been profiled as below:
Analysis Variable: Cluster_Final
Cluster_Final N Obs N
1 103 103
2 242 242
3 166 166
Cluster_Final=1
Variable N Mean Popltn Mean Popltn Std. Dev Z-Score
PCAT1 103 36.32 38.41 8.82 0.24
PCAT2 103 23.25 24.95 8.25 0.21
PCAT3 103 15.00 13.73 4.80 0.26
PCAT4 103 25.44 22.91 4.70 0.54
Avg_Sales_Final 103 200.36 208.80 48.77 0.17
Cluster_Final=2
Variable N Mean Popltn Mean Popltn Std. Dev Z-Score
PCAT1 242 41.74 38.41 8.82 0.38
PCAT2 242 21.87 24.95 8.25 0.37
PCAT3 242 14.46 13.73 4.80 0.15
PCAT4 242 21.94 22.91 4.70 0.21
Avg_Sales_Final 242 222.93 208.80 48.77 0.29
Cluster_Final=3
Variable N Mean Popltn Mean Popltn Std. Dev Z-Score
PCAT1 166 34.85 38.41 8.82 0.40
PCAT2 166 30.50 24.95 8.25 0.67
PCAT3 166 11.88 13.73 4.80 0.39
PCAT4 166 22.76 22.91 4.70 0.03
Avg_Sales_Final 166 193.45 208.80 48.77 0.31
Table of Cluster_Final by State
Cluster_Final State (State) Total
KA TN
Frequency
Percent
1 63 40 103
12.33 7.83 20.16
2 131 111 242
25.64 21.72 47.36
3 86 80 166
16.83 15.66 32.49
Total 280 231 511
54.79 45.21 100
Analysis Var_Size
Cluster_Final Mean Size
Popltn
Mean
Popltn Std.
Dev Z-Score
1 2949.22 2942.32 423.52 0.016
2 2854.61 2942.32 423.52 0.207
3 3065.9 2942.32 423.52 0.292
Conclusion:
• Thus, both the graphical plots as well as the summary
stats of the 3 clusters obtained using the Ward’s method
reveal no clear cluster formation.
• As such, no particular variable has been found
dominating in any of the 3 cluster formations.
59
60. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
SAS Code:
**Hierarchial Clustering procedure being performed on the 9 preliminary clusters obtained using K-Means** ;
**The data set using which K-Means clustering was performed to obtain the preliminary 9 clusters has been treated for
outliers and hence doesn't contain any outliers** ;
*Ward's Method* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_W Method = Ward CCC Pseudo ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Copy Cluster ;
ID Cluster ;
Run;
Proc Tree Data = Tree_9_W Horizontal Lines=(color=blue)
out = Tree_Out_9_W nclusters = 3 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
ID Cluster ;
Run;
Proc Print Data = Tree_Out_9_W ;
Run; 60
61. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
SAS Code:
*Profiling of the Clusters formed using Ward's Method* ;
*Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary
cluster analysis have been mapped to the
final 3 clusters obtained by the Ward's method* ;
Data Stores_Final_Analysis_W ;
Set Stores_1_Final_Merged ;
If Cluster = 1 OR Cluster = 2 Then Cluster_Final_W = 1 ;
Else If Cluster = 5 OR Cluster = 6 Then Cluster_Final_W = 3 ;
Else If Cluster = 3 OR Cluster= 4 OR Cluster=7 OR Cluster= 8 OR Cluster= 9 Then Cluster_Final_W = 2 ;
Run ;
Proc Sort Data = Stores_Final_Analysis_W ;
By Cluster_Final_W ;
Run;
Proc Means Data = Stores_Final_Analysis_W N;
Var Cluster_Final_W;
Class Cluster_Final_W ;
Run; 61
62. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Ward’s Method
SAS Code:
Data Stores_Final_Analysis_W ;
Set Stores_Final_Analysis_W;
Avg_Sales_Final = Avg_Sales * 1000 ;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
By Cluster_Final_W ;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean Std;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run;
Proc Freq Data = Stores_Final_Analysis_W ;
Tables Cluster_Final_W*State / nocol norow nocum;
Run;
Proc Means Data = Stores_Final_Analysis_W N Mean ;
Var Size ;
By Cluster_Final_W ;
Run;
62
64. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The output for the Density method discussed below and in the following slides is when K=7.
(Detailed output on path: ‘Y:Assignment – ClusteringWorkings.xlsx’. Refer tab named ‘Output_Density_K7’. For output when
K=8 & K=9 refer tab named ‘Output_Density_K8’ & ‘Output_Density_K9’.)
Cluster History
Number Of
Clusters
First Cluster
Joined
Second Cluster
Jointed
Frequency Of
New Cluster
Semipartial
RSq RSquared
Pseudo F
Statistic
Pseudo t-
squared
Approximate
Expected RSq
Cubic
Clustering
Criteria
Normalized
Fusion
Density
Lesser
Density
Greater
Density Tie
8 3 5 182 0.0324 0.97 2147. . . 61.799 44.7166 100
7 CL8 7 263 0.0826 0.89 647 665. . 38.79 24.0617 100
6 CL7 1 309 0.1255 0.76 319 335. . 35.798 21.8011 100 T
5 CL6 4 350 0.0541 0.71 303 78.3. . 35.798 21.8011 100
4 CL5 6 417 0.0911 0.61 269 128. . 26 14.9422 100
3 CL4 2 474 0.1869 0.43 190 229. . 7.2274 3.7492 100
2 CL3 9 504 0.3124 0.12 66.2 274. . 6.0544 3.1217 100
1 CL2 8 511 0.1151 0 . 66.2 0 0 2.1174 1.07 100
# of clusters according to:
Pseudo T-Square: 5, 4
Semipartial R-Square: 8,7,5,4
Therefore, final # of clusters considered in this iteration = 5
Since the Tie occurs in the early history of the cluster formation, it should have only a little effect on the later
stages and hence can be overlooked. 64
65. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The Tree diagram, from the Density method when K=7, is as below:
65
66. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The following are the plots obtained from the Density method when K=7:
66
67. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The following are the plots obtained from the Density method when K=7:
67
68. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The following elaborates on the profiles of the final 5 clusters obtained from the Density method:
Analysis Variable: Cluster_Final_D
Cluster_Final_D N Obs N
1 326 326
2 67 67
3 81 81
4 7 7
5 30 30
Cluster_Final_D=1
Variable N Mean Popltn Mean
Popltn Std
Dev Z-Score
PCAT1 326 36.80 38.41 8.82 0.18
PCAT2 326 25.25 24.95 8.25 0.04
PCAT3 326 14.69 13.73 4.80 0.20
PCAT4 326 23.26 22.91 4.70 0.07
Avg_Sales_Final 326 209.48 208.80 48.77 0.01
• No particular variable has emerged as a dominating
variable responsible for the formation of this cluster.
• Mean values of the variables in this cluster are very
near to the overall mean scores of the variables in the
data set.
Legend:
Cat1 Fresh Foods
Cat2 Frozen Foods
Cat3 Health & Beauty
Cat4 Tobacco & Alcohol
Cluster_Final_D=2
Variable N Mean Popltn Mean
Popltn Std
Dev Z-Score
PCAT1 67 31.57 38.41 8.82 0.78
PCAT2 67 33.61 24.95 8.25 1.05
PCAT3 67 10.78 13.73 4.80 0.61
PCAT4 67 24.04 22.91 4.70 0.24
Avg_Sales_Final 67 173.18 208.80 48.77 0.73
PCAT2 has emerged as a dominating variable and is the
most determining variable in the formation of this cluster
with nearly 13% of the total no. of stores having a mean
higher than 25%.
68
69. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
Cluster_Final_D=3
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 81 49.76 38.41 8.82 1.29
PCAT2 81 15.03 24.95 8.25 1.20
PCAT3 81 12.69 13.73 4.80 0.22
PCAT4 81 22.52 22.91 4.70 0.08
Avg_Sales_Final 81 183.23 208.80 48.77 0.52
Both PCAT1 and PCAT2 have emerged as the
dominating variables in Cluster 1 with nearly 16%
of the total no. of stores having a mean higher
than the overall mean of these 2 categories.
Cluster_Final_D=4
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 7 39.64 38.41 8.82 0.14
PCAT2 7 26.63 24.95 8.25 0.20
PCAT3 7 13.52 13.73 4.80 0.04
PCAT4 7 20.21 22.91 4.70 0.58
Avg_Sales_Final 7 351.02 208.80 48.77 2.92
Avg Sales per Sq. Foot has emerged as the dominating
variable in Cluster 4 with mean avg sales per sq. foot
significantly higher than the mean overall avg sales per sq.
foot with nearly 1.4% of the total no. of stores having a
mean greater than the overall mean of avg sales per sq.
foot.Cluster_Final_D=5
Variable N Mean
Popltn
Mean
Popltn Std
Dev Z-Score
PCAT1 30 40.27 38.41 8.82 0.21
PCAT2 30 28.77 24.95 8.25 0.46
PCAT3 30 12.71 13.73 4.80 0.21
PCAT4 30 18.25 22.91 4.70 0.99
Avg_Sales_Final 30 316.83 208.80 48.77 2.21
Avg Sales per Sq. Foot has emerged as the dominating
variable in Cluster 5 with mean avg sales per sq. foot
significantly higher than the mean overall avg sales per sq.
foot with nearly 6% of the total no. of stores having a mean
greater than the overall mean of avg sales per sq. foot.
69
70. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
The FREQ Procedure
Table of Cluster_Final_D by State
Cluster_Final_D State (State) Total
KA TN
Frequency
Percent
1 181 145 326
35.42 28.38 63.8
2 37 30 67
7.24 5.87 13.11
3 43 38 81
8.41 7.44 15.85
4 3 4 7
0.59 0.78 1.37
5 16 14 30
3.13 2.74 5.87
Total 280 231 511
54.79 45.21 100
No specific pattern has emerged in the state-wise
analysis of the clusters formed.
Analysis Var_Size
Cluster_Final Mean Size
Popltn
Mean
Popltn Std.
Dev Z-Score
1 2923.97 2942.32 423.52 0.04
2 3184.63 2942.32 423.52 0.57
3 3172.1 2942.32 423.52 0.54
4 1977.14 2942.32 423.52 2.28
5 2205.33 2942.32 423.52 1.74
• The average size of the stores in cluster 4 is much lesser
than the overall average size of the stores in the given data
set.
• Hence, the avg sales per sq. foot for stores in this cluster is
also significantly higher than the overall average sales per
sq. foot.
70
71. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
*Density Method* ;
*K, the smoothing parameter is the no. of clusters obtained in the preliminary cluster analysis* ;
*Literature suggests using n^0.3 preliminary clusters where n=no. of observations in the original data set*;
*Hence, n^0.3 = 511^0.3 = 6.5 . So analysis in the Density method has been done for 7<=K<=9* ;
*K = 7* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D7 Method = Density K=7 CCC Pseudo ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Copy Cluster ;
ID Cluster ;
Run;
Proc Tree Data = Tree_9_D7 Horizontal Lines=(color=blue)
out = Tree_Out_9_D7 nclusters=5 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
ID Cluster ;
Run;
Proc Print Data = Tree_Out_9_D7 ;
Run; 71
72. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
*Profiling of the Clusters formed using Density Method for K=7* ;
*Based on the data set generated from the TREE procedure, the 9 clusters obtained from the preliminary cluster
analysis have been mapped to the final 5 clusters obtained by the Density method* ;
Data Stores_Final_Analysis_D ;
Set Stores_1_Final_Merged ;
If Cluster = 1 OR Cluster = 2 OR Cluster = 3 OR Cluster = 4 OR Cluster = 5 Then Cluster_Final_D = 1 ;
Else If Cluster = 6 Then Cluster_Final_D = 2 ;
Else If Cluster = 7 Then Cluster_Final_D = 3 ;
Else If Cluster = 8 Then Cluster_Final_D = 4 ;
Else If Cluster = 9 Then Cluster_Final_D = 5 ;
Run ;
Proc Sort Data = Stores_Final_Analysis_D ;
By Cluster_Final_D ;
Run;
Proc Means Data = Stores_Final_Analysis_D N;
Var Cluster_Final_D;
Class Cluster_Final_D ;
Run;
72
73. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
Data Stores_Final_Analysis_D ;
Set Stores_Final_Analysis;
Avg_Sales_Final = Avg_Sales * 1000 ;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
By Cluster_Final_D ;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean Std;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales_Final ;
Run;
Proc Freq Data = Stores_Final_Analysis_D ;
Tables Cluster_Final_D*State / nocol norow nocum;
Run;
Proc Means Data = Stores_Final_Analysis_D N Mean ;
Var Size ;
By Cluster_Final_D ;
Run;
73
76. 2. METHODOLOGY
d. Hierarchial Clustering (PROC CLUSTER): Density Method
SAS Code:
• *K = 9* ;
Proc Cluster Data = Mean_Clusters_9 Outtree = Tree_9_D9 Method = Density K=9 CCC
Pseudo ;
Var PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Copy Cluster ;
Run;
Proc Tree Data = Tree_9_D9 Horizontal Lines=(color=blue)
out = Tree_Out_9_D9 nclusters=5 ;
Copy PCAT1 PCAT2 PCAT3 PCAT4 Avg_Sales5 ;
Run;
Proc Print Data = Tree_Out_9_D9 ;
Run;
76
77. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
77
78. 3. SUMMARY OF INSIGHTS
1. 16% of the stores have their mean sales from the Fresh Food category higher than the overall average in this category.
2. 13% of the stores have their mean sales from the Frozen Food category higher than the overall average in this category.
3. 16% of the stores have their mean sales from the Frozen Food category lower than the overall average in this category.
4. The % sales from the category of Health & Beauty in all the clusters formed above is nearly around the overall mean sales of this category.
5. Only 5% of the total no. of stores have their mean sales from the category Tobacco & Alcohol lower than the overall mean sales of this category.
6. 7% of the total stores have their average sales per sq. foot significantly higher than the overall average. The difference is particularly more pronounced
for stores in Cluster 4 in which the average size of the stores is also much lesser than the overall average size.
7. 29% of the total stores have their average sales per sq. foot significantly lower than the overall average.
Cluster # No. of stores
3 81
Cluster # No. of stores
2 67
Cluster # No. of stores
3 81
Cluster # No. of stores
4 7
5 30
Cluster # No. of stores
2 67
3 81 78
79. CONTENTS
1. Objective
2. Methodology
a. Exploratory Data Analysis
b. Data Preparation
i. Scaling
ii. Weighting
c. Preliminary Cluster Analysis (PROC FASTCLUS)
i. Creation of K preliminary clusters
ii. Detection of Outliers
iii. Treatment of Outliers
• Alternative treatment of outliers also attempted
iv. Clustering of the data set treated for outliers
v. Evaluating & Profiling of clusters formed in preliminary analysis
d. Hierarchial Clustering (PROC CLUSTER)
i. Ward’s Method
ii. Density Method
3. Summary of Insights
4. Recommendations
79
80. 4. RECOMMENDATIONS
1. Cluster 3
a The size of stores in this cluster is higher than the average size of all stores though the difference is not significant.
b When compared to the overall mean sales of all stores from Fresh Foods category, the contribution to revenue from the Fresh Food Category
is highest from stores in this cluster.
c However, when compared with the overall mean sales of all stores from the category of Frozen Foods, the contribution to revenue from the Frozen Food category is
lowest from stores in this cluster.
d The average sales per square foot from stores in this cluster is also lower when compared with the overall average sales per sq. foot of all stores.
e The above observations therefore imply that although Fresh Foods category is contributing the most to the sales but perhaps this contribution is not enough to
increase the overall sales of the stores which are lesser than the average of all other stores despite a greater size of stores.
There is therefore a need to may be adopt techniques such as better placement of such products or a promotional campaign targeted specifically for products in this
category.
Strategies may also be devised for promoting sales from Frozen Food category as they are significantly lesser than the overall average sales of this
category in other stores.
One possibility is that sales from Fresh Foods category is cannibalizing the sales from Frozen Foods category and hence an alternative shelf placement is
required.
80
81. 4. RECOMMENDATIONS
2. Cluster 2
a As compared to stores in Cluster 3, a contrasting situation is seen for stores in this cluster.
b
The sales from Frozen Food category are contributing the most to the overall revenue of stores in this cluster and are greater than the overall mean sales from this
category in all other stores
Whereas, sales from the Fresh Food category are lower than the overall mean sales from this category in other stores.
c The average size of stores in this cluster is roughly the same as the size of stores in Cluster 3 and is higher than the overall mean size of other stores.
d
Also, the average sales per sq. foot is lesser than the overall average sales per sq. foot of other stores. They are also lesser than the average sales per sq. foot of stores
in Cluster 3.
e
Hence, strategies similar to those to be adopted for stores in Cluster 3 may also be replicated for stores in Cluster 2 for promoting sales from both the Fresh Foods category as
well as the Frozen Foods category.
This may be done after gaining insights into the factors that are driving the Frozen Food sales in stores of Cluster 2 and Fresh Food sales in stores of Cluster 3.
81
82. 4. RECOMMENDATIONS
3. Cluster 1
4. Cluster 4
a Stores in Cluster 1, roughly 64%, are highest in no. as compared to stores in other clusters.
b Sales from all 4 categories of products of stores in this cluster are very close to the overall mean sales of each of the four categories in all the stores.
c The average size of the stores in this cluster is also very close to the overall average size of all stores.
d
Since this cluster has the highest and a significant % of no. of stores, promotional activities adopted for all these stores can perhaps also help in
significantly increasing the overall sales volume of the Retailer X.
a
This cluster houses only 1% of the total stores with the only differentiating factor being the average sales per sq. foot which is significantly higher than the overall
average for other stores.
b The mean sales of products in each of the 4 categories is very similar to the overall mean sales of those categories.
c Hence, the only possible reason for a significantly higher average sales per sq. foot is the lower than overall average size of the stores.
No specific state-wise pattern has emerged for these stores with the distribution being fairly consistent in both the states, KA & TN.
82
83. 4. RECOMMENDATIONS
5. Cluster 5
6.
a This cluster houses nearly 6% of the total stores
b The distribution of variables for stores in this cluster is almost similar to stores in Cluster 4.
c However, it may be noted that sales from the Tobacco & Alcohol category are lower than the overall mean sales of other stores from this category.
Having identified the drivers of sales in stores of each of the 5 clusters, it is next important to understand other factors that influence each of these
drivers.
Inclusion of demographic factors such as age, income, location, gender etc. as additional variables, may give better insights into the promotional strategies, unique to
each cluster, that may be adopted for increasing the sales.
83