SlideShare a Scribd company logo
1
Big Data Analysis Project
Group #18
Aaron Fuhrman
Jessica Demuro
Wilson Tapia
AJW Consultants, Inc.
The Pennsylvania State University
12/9/14
2
Big Data Analysis Project
Aaron Fuhrman
Jessica Demuro
Wilson Tapia
AJW Consultants, Inc.
The Pennsylvania State University
12/9/14
Table of Contents
1.)
Introduction………………………………………………………………………………….2
2.) Methods………………………………………………………………………………….5
2.1) Moving Average and Exponential Smoothing………………………………....3
2.2) Regression Analysis…………….………………………………………………4
2.3) K-Means Clustering…………………………….……………………………...4
2.4) Regression……………………………………………………………………. 7
3) Results……………………………………………………………………………………5
Introduction:
This assignment uses predictive analytics methods to analyze the given transaction data
and make meaningful recommendations for the store manager to improve sales. Predictive
analytics calculations including moving average, exponential smoothing, linear regression and
k-means clustering are used and included in this report. There is a wide variety of data included
in the transaction data that varies from week to week.
The first calculation in this analysis is a moving average and exponential smoothing
calculation which calculates the prediction error for weeks 10 through 20 of the given data. The
Mean Squared Error, Mean Absolute Deviation, and Tracking Signal are then calculated from the
given data. The next calculation involves a regression analysis of the given transaction data.
Lastly, the k-means clustering algorithm is used to cluster the customers in the database based on
their buying patterns. The data is then analyzed and used to make a final conclusion about the
relationships existing among the customers and transaction data.
3
Problem Description:
The customer base and transaction data is to be analyzed in order to develop appropriate
strategies to increase sales. To do this, a series of predictive analytic methods will be used. Some
of these methods include moving average and exponential smoothing, regression and k-means
calculations. After these tests are performed, they will be analyzed in order to make
recommendations to the management.
Methodology:
A variety of predictive analytics methods were used in this project. They are described in
detail below.
2.1 Moving Average and Exponential Smoothing
This part of the database analysis entailed the forecast of the items being sold and also the
error of this forecast compared to the actual value for weeks 10-20. More specifically, we had to
calculate the Mean Absolute Deviation (MAD), Mean Square Error (MSE) and the Tracking
Signal (TS). To obtain these numbers we had to first create a forecast of the data by using two
methods. One was a moving average and the other exponential smoothing. For the moving
average we used an 3pt moving average which created a prediction for the following week. To
create this forecast we calculated the average of the previous three weeks and used that number
as the prediction for the next. This method was repeated until we received a forecast for weeks 12
through 20. Theoretically, the more points this moving average contains, the more accurate it
becomes. As a subset, we chose cereal and cat food to represent the process. The exponential
smoothing uses an alpha value, the forecast and the actual value for the previous week to predict
the next. One important piece of information to note is that the forecast must be integers. You
will always round up to the next whole integer if a decimal was present. The reason behind this is
because you cannot purchase half or a piece of an item so a full item must be purchased to fill the
void. What the alpha value does in the exponential smoothing is give a weight to how much the
previous actual value affects the forecast for the new value. In this case we chose an alpha value
of .2. This alpha value was chosen arbitrarily with the constraint that it had to be between 0 and
1. We used the results with the actual values to create a graph and 2 Tables that used the weeks
and the amount of units bought (Figures A-1 thought A-6 and Tables 1-4)
Using the forecasted values and the actual values, we were able to obtain the MAD, MSE,
and TS. The MAD measures the parameters of the data. In other words by how much does the
forecast deviate from the actual values without regard to sign. The Mean Square Error measures
the difference between the forecast and the actual values. Last we have the Tracking Signal
which measures the bias in the forecast. The MAD was calculated by taking the differences
between the actual and forecast and summing all this differences without regard to sign, in other
words, taking the absolute value of each difference before adding it to the rest.. Lastly, the
differences were divided by the number of inputs ,or averaged, and this gave you your MAD. The
MSE was calculated in a similar manner, except instead of taking the absolute value of these
numbers, they were squared and then again averaged. For the last item, the TS was calculated but
4
summing all of the differences between the actual values and forecasted values and dividing this
sum by the MAD of that respective tem. When translated to excel, the main functions used were
ABS, ROUNDUP, sqr, and other basic mathematical computations.
2.2 Regression Analysis
Using all 104 weeks of the transaction data for the top 10 items, pivot tables were created
in excel. Once the top 10 items where obtained, a regression analysis was done in minitab which
summarized the ANOVA table, correlation coefficient, R-sq, F-values, fitted models and the
regression equation. A table was created that compared the different item types and the R square
value. For question #3 a pivot table was used. However we use only 60 weeks of the data and the
top three items sold. Regression was done in Minitab to obtain the regression equation. In
question #4 we used this regression equation and plugged the range of weeks (61-80) into the
equation. Different values were then obtained.
2.3 k-Means Clustering
The k-means clustering algorithm was used to cluster the customers based on their buying
patterns. The information obtained was used to identify any correlation between the customer’s
buying patterns and demographic information. Three different sets of two items were chosen and
clustered with k=3. First, the items of eggs and ice cream (items 12 and 13) were chosen. A pivot
table was then made in Excel to display each customer ID with the units of eggs and ice cream
purchased. The data was then pasted into the sample k-means Excel spreadsheet provided, which
then calculated the centroids based on the designated amount of clusters. In the first calculation
containing eggs and ice cream, the items were also clustered with k=4,5, and 6 as well.
2.4 Regression
Figure 2.1: Top 10 selling items over two years (104 weeks)
5
This screenshot was created in microsoft excel with a pivot table by using the “ item
type” as columns the “weeks” as rows and the sum of units bought as the value, it shows
the top 10 items that where sold the most in the 104 weeks
Table 2.1: Shows the ranking of the top 10 item type soldwith the total of their unit
bought in the 2 years.
Item Type Ranking Total Volume
17 1 7776
6
12 2 6879
3 3 6254
8 4 5165
5 5 4961
2 6 4006
4 7 3930
13 8 3538
9 9 3382
11 10 2365
The graphs below represent the time series plot of the top 10 items sold and their units
bought in the 104 weeks.
Graph 2.1: Time series plot of the Weeks vs Number of units bought on item 17
7
Graph 2.2: Time series plot of the Weeks vs Number of units bought on item 12
Graph 2.3: Time series plot of the Weeks vs Number of units bought on item 3
8
Graph 2.4: Time series plot of the Weeks vs Number of units bought on item 8
Graph 2.5: Time series plot of the Weeks vs Number of units bought on item 5
9
Graph 2.6: Time series plot of the Weeks vs Number of units bought on item 2
Graph 2.7: Time series plot of the Weeks vs Number of units bought on item 4
10
Graph 2.8: Time series plot of the Weeks vs Number of units bought on item 13
Graph 2.9: Time series plot of the Weeks vs Number of units bought on item 9
11
Graph 2.10: Time series plot of the Weeks vs Number of units bought on item 11
The regression analysis was performed using minitab . Minitab summarized for us the analysis of
variances, ANOVA table, correlation coefficient, R-sq, F-values, fitted models and the regression
equation
12
The Simple Linear Regression Analysis of each one of the items against the week are shown
below. This was done by carrying out a detailed Analysis of Variance which shows values like
degrees of freedom, correlation coefficient, P-values, F-values and the Regression Equation.
They were all done using Minitab.
RegressionAnalysis: 2 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 745.1 745.1 2.70 0.104
Weeks 1 745.1 745.1 2.70 0.104
Error 102 28196.8 276.4
Total 103 28942.0
Model Summary
S R-sq R-sq(adj) R-sq(pred)
16.6265 2.57% 1.62% 0.00%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 43.20 3.28 13.15 0.000
Weeks -0.0892 0.0543 -1.64 0.104 1.00
Regression Equation
Y = 43.20 - 0.0892 Weeks
Fits and Diagnostics forUnusual Observations
Obs 2 Fit Resid StdResid
40 73.00 39.63 33.37 2.02 R
44 0.00 39.28 -39.28 -2.37 R
45 0.00 39.19 -39.19 -2.37 R
46 0.00 39.10 -39.10 -2.36 R
47 0.00 39.01 -39.01 -2.36 R
49 0.00 38.83 -38.83 -2.35 R
50 0.00 38.74 -38.74 -2.34 R
51 0.00 38.65 -38.65 -2.34 R
13
52 0.00 38.56 -38.56 -2.33 R
101 72.00 34.19 37.81 2.31 R
R Large residual
RegressionAnalysis: 3 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 1042 1042 0.71 0.402
Weeks 1 1042 1042 0.71 0.402
Error 102 150216 1473
Total 103 151258
Model Summary
S R-sq R-sq(adj) R-sq(pred)
38.3758 0.69% 0.00% 0.00%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 65.67 7.58 8.66 0.000
Weeks -0.105 0.125 -0.84 0.402 1.00
Regression Equation
Y = 65.67 - 0.105 Weeks
Fits and Diagnostics forUnusual Observations
Std
Obs 3Fit Resid Resid
2 167.00 65.46 101.54 2.70 R
22 250.00 63.35 186.65 4.91 R
80 167.00 57.23 109.77 2.89 R
96 156.00 55.55 100.45 2.66 R
R Large residual
14
RegressionAnalysis: 4 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 37.6 37.63 0.09 0.766
Weeks 1 37.6 37.63 0.09 0.766
Error 102 43035.7 421.92
Total 103 43073.3
Model Summary
S R-sq R-sq(adj) R-sq(pred)
20.5407 0.09% 0.00% 0.00%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 38.84 4.06 9.57 0.000
Weeks -0.0200 0.0671 -0.30 0.766 1.00
Regression Equation
Y = 38.84 - 0.0200 Weeks
Fits and Diagnostics forUnusual Observations
Std
Obs 4Fit Resid Resid
98 104.00 36.88 67.12 3.32 R
R Large residual
RegressionAnalysis: 5 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 2430 2429.9 5.65 0.019
Weeks 1 2430 2429.9 5.65 0.019
15
Error 102 43872 430.1
Total 103 46302
Model Summary
S R-sq R-sq(adj) R-sq(pred)
20.7392 5.25% 4.32% 2.04%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 56.15 4.10 13.71 0.000
Weeks -0.1610 0.0677 -2.38 0.019 1.00
Regression Equation
Y = 56.15 - 0.1610 Weeks
Fits and Diagnostics forUnusual Observations
Obs 5 Fit Resid StdResid
44 0.00 49.07 -49.07 -2.38 R
45 0.00 48.91 -48.91 -2.37 R
46 0.00 48.75 -48.75 -2.36 R
47 0.00 48.59 -48.59 -2.35 R
49 0.00 48.27 -48.27 -2.34 R
50 0.00 48.10 -48.10 -2.33 R
51 0.00 47.94 -47.94 -2.32 R
52 0.00 47.78 -47.78 -2.32 R
84 88.00 42.63 45.37 2.21 R
88 87.00 41.99 45.01 2.20 R
R Large residual
RegressionAnalysis: 8 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 7345 7345.1 14.97 0.000
Weeks 1 7345 7345.1 14.97 0.000
16
Error 102 50042 490.6
Total 103 57387
Model Summary
S R-sq R-sq(adj) R-sq(pred)
22.1497 12.80% 11.94% 9.99%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 64.36 4.38 14.71 0.000
Weeks -0.2799 0.0723 -3.87 0.000 1.00
Regression Equation
Y = 64.36 - 0.2799 Weeks
Fits and Diagnostics forUnusual Observations
Obs 8 Fit Resid StdResid
44 0.00 52.04 -52.04 -2.36 R
45 0.00 51.76 -51.76 -2.35 R
46 0.00 51.48 -51.48 -2.34 R
47 0.00 51.20 -51.20 -2.32 R
49 0.00 50.64 -50.64 -2.30 R
50 0.00 50.36 -50.36 -2.28 R
51 0.00 50.08 -50.08 -2.27 R
52 0.00 49.80 -49.80 -2.26 R
R Large residual
RegressionAnalysis: 9 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 2660 2660.0 12.48 0.001
Weeks 1 2660 2660.0 12.48 0.001
Error 102 21742 213.2
Total 103 24402
17
Model Summary
S R-sq R-sq(adj) R-sq(pred)
14.5999 10.90% 10.03% 8.04%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 41.36 2.88 14.34 0.000
Weeks -0.1685 0.0477 -3.53 0.001 1.00
Regression Equation
Y = 41.36 - 0.1685 Weeks
Fits and Diagnostics forUnusual Observations
Obs 9 Fit Resid StdResid
9 69.00 39.85 29.15 2.03 R
27 76.00 36.82 39.18 2.71 R
44 0.00 33.95 -33.95 -2.34 R
45 0.00 33.78 -33.78 -2.33 R
46 0.00 33.61 -33.61 -2.31 R
47 0.00 33.45 -33.45 -2.30 R
49 0.00 33.11 -33.11 -2.28 R
50 0.00 32.94 -32.94 -2.27 R
51 0.00 32.77 -32.77 -2.26 R
52 0.00 32.60 -32.60 -2.24 R
88 57.00 26.54 30.46 2.11 R
R Large residual
RegressionAnalysis: 11 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 104.4 104.4 0.46 0.499
Weeks 1 104.4 104.4 0.46 0.499
Error 102 23169.6 227.2
Total 103 23274.0
18
Model Summary
S R-sq R-sq(adj) R-sq(pred)
15.0716 0.45% 0.00% 0.00%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 24.49 2.98 8.23 0.000
Weeks -0.0334 0.0492 -0.68 0.499 1.00
Regression Equation
Y = 24.49 - 0.0334 Weeks
Fits and Diagnostics forUnusual Observations
Std
Obs 11 Fit Resid Resid
88 102.00 21.56 80.44 5.40 R
101 52.00 21.12 30.88 2.09 R
104 67.00 21.02 45.98 3.11 R
R Large residual
RegressionAnalysis: 12 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 11 10.85 0.01 0.935
Weeks 1 11 10.85 0.01 0.935
Error 102 164700 1614.71
Total 103 164711
Model Summary
S R-sq R-sq(adj) R-sq(pred)
40.1834 0.01% 0.00% 0.00%
19
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 65.58 7.94 8.26 0.000
Weeks 0.011 0.131 0.08 0.935 1.00
Regression Equation
Y = 65.58 + 0.011 Weeks
Fits and Diagnostics forUnusual Observations
Std
Obs 12 Fit Resid Resid
27 154.00 65.87 88.13 2.21 R
38 152.00 65.99 86.01 2.15 R
41 148.00 66.02 81.98 2.05 R
77 147.00 66.41 80.59 2.02 R
81 171.00 66.45 104.55 2.63 R
96 219.00 66.61 152.39 3.85 R
R Large residual
RegressionAnalysis: 13 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 939.3 939.3 2.97 0.088
Weeks 1 939.3 939.3 2.97 0.088
Error 102 32214.7 315.8
Total 103 33154.0
Model Summary
S R-sq R-sq(adj) R-sq(pred)
17.7716 2.83% 1.88% 0.00%
Coefficients
20
Term Coef SECoef T-Value P-Value VIF
Constant 39.27 3.51 11.19 0.000
Weeks -0.1001 0.0580 -1.72 0.088 1.00
Regression Equation
Y = 39.27 - 0.1001 Weeks
Fits and Diagnostics forUnusual Observations
Std
Obs 13Fit Resid Resid
1 82.00 39.17 42.83 2.46 R
9 83.00 38.37 44.63 2.55 R
101 82.00 29.16 52.84 3.03 R
R Large residual
RegressionAnalysis: 17 versus Weeks
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 1 74 73.74 0.07 0.788
Weeks 1 74 73.74 0.07 0.788
Error 102 103299 1012.73
Total 103 103372
Model Summary
S R-sq R-sq(adj) R-sq(pred)
31.8235 0.07% 0.00% 0.00%
Coefficients
Term Coef SECoef T-Value P-Value VIF
Constant 76.24 6.29 12.13 0.000
Weeks -0.028 0.104 -0.27 0.788 1.00
21
Regression Equation
Y = 76.24 - 0.028 Weeks
Fits and Diagnostics forUnusual Observations
Obs 17 Fit Resid Std Resid
44 0.00 75.01 -75.01 -2.37 R
45 0.00 74.98 -74.98 -2.37 R
46 0.00 74.95 -74.95 -2.37 R
47 0.00 74.92 -74.92 -2.37 R
49 0.00 74.87 -74.87 -2.36 R
50 0.00 74.84 -74.84 -2.36 R
51 0.00 74.81 -74.81 -2.36 R
52 0.00 74.78 -74.78 -2.36 R
104 136.00 73.32 62.68 2.01 R
R Large residual
Table 2.2: Correlation coefficient and Adjusted correlation coefficient of the top 10 item
types sold.
Item Type R^2 Adj_R^2
17 0.07% 0.00%
12 2.83% 1.88%
3 0.01% 0.00%
8 0.45% 0.00%
5 10.90% 10.03%
2 12.80% 11.94%
4 5.25% 4.32%
13 0.09% 0.00%
9 0.69% 0.00%
22
11 2.57% 1.62%
The correlation coefficients above are shown in percentage.
23
3) Figure 3.1: Top 3 selling items over two years (60 weeks)
In order to perform question #3 team constructed a pivot table in excel from the transactional
data. In the pivot table we used Item Type as a Column, Weeks as Rows, and Sum of units bough
24
in the . From this we obtained that the top values where 17,12 & 3 with a grand total
of 4487, 3815, and 3526 respectably. We used Minitab to obtain “The Fitted Line Plot” and the
regression analysis for the top three weeks. From this analysis we were able to obtain the desire
equations for question#4. We also perform an OLS regression analysis in excel as well.
Grand Total 3526 3815 4487 11828
Row Labels 3 12 17
Grand
Total
Graph 3.1: Fitted Line Plot the first 60 weeks vs. number of units sold for item 17
RegressionAnalysis: 17 versus Weeks
The regression equation is
25
Y = 85.11 + 0.0423 Weeks
S = 22.5467 R-Sq = 0.1% R-Sq(adj) = 0.0%
Analysis of Variance
Source DF SS MS F P
Regression 1 27.0 27.017 0.05 0.819
Error 50 25417.7 508.353
Total 51 25444.7
Graph 3.2: FittedLine Plot the first 60 weeks vs. number of units sold for item 12
RegressionAnalysis: 12 versus Weeks
The regression equation is
Y = 76.18 - 0.1012 Weeks
26
S = 33.5337 R-Sq = 0.3% R-Sq(adj) = 0.0%
Analysis of Variance
Source DF SS MS F P
Regression 1 154.6 154.61 0.14 0.712
Error 50 56225.4 1124.51
Total 51 56380.1
Graph 3.3: FittedLine Plot the first 60 weeks vs. number of units sold for item 3
RegressionAnalysis: 3 versus Weeks
The regression equation is
Y = 81.59 - 0.4956 Weeks
S = 38.3690 R-Sq = 4.8% R-Sq(adj) = 2.9%
27
Analysis of Variance
Source DF SS MS F P
Regression 1 3711.2 3711.19 2.52 0.119
Error 50 73608.9 1472.18
Total 51 77320.1
The fitted line charts and the regression analysis was performed for the top 3 items sold in the
first 60 weeks (butter, snacks and eggs). The regression analysis performed in minitab provides
us the ANOVA table, the regression equation and the R-sq and Adj-R-sq
4)
For question #4 our group predicted the total weekly sales quantity for weeks 61 through
80. For this question we used the solution from question #3 and copy the equation for item 17, 12
and 3. Since we have the equation as you can observe from above we replace the numbers 61
through 80 in “Weeks” to predict the sales for each week. For this question Microsoft Excel was
used.
TABLE 4.1: The forecasted sales from week 61 to 80 for Item 17
Weeks Formula for item 17 Answer
61 Y = 85.11 + 0.0423 Weeks 87.6903
62 Y = 85.11 + 0.0423 Weeks 87.7326
63 Y = 85.11 + 0.0423 Weeks 87.7749
64 Y = 85.11 + 0.0423 Weeks 87.8172
65 Y = 85.11 + 0.0423 Weeks 87.8595
66 Y = 85.11 + 0.0423 Weeks 87.9018
67 Y = 85.11 + 0.0423 Weeks 87.9441
68 Y = 85.11 + 0.0423 Weeks 87.9864
28
69 Y = 85.11 + 0.0423 Weeks 88.0287
70 Y = 85.11 + 0.0423 Weeks 88.071
71 Y = 85.11 + 0.0423 Weeks 88.1133
72 Y = 85.11 + 0.0423 Weeks 88.1556
73 Y = 85.11 + 0.0423 Weeks 88.1979
74 Y = 85.11 + 0.0423 Weeks 88.2402
75 Y = 85.11 + 0.0423 Weeks 88.2825
76 Y = 85.11 + 0.0423 Weeks 88.3248
77 Y = 85.11 + 0.0423 Weeks 88.3671
78 Y = 85.11 + 0.0423 Weeks 88.4094
79 Y = 85.11 + 0.0423 Weeks 88.4517
80 Y = 85.11 + 0.0423 Weeks 88.494
TABLE 4.12: The forecasted sales from week 61 to 80 for Item 12
Weeks Formula for item 12 Answer
61 Y = 76.18 - 0.1012 Weeks 70.0068
62 Y = 76.18 - 0.1012 Weeks 69.9056
63 Y = 76.18 - 0.1012 Weeks 69.8044
64 Y = 76.18 - 0.1012 Weeks 69.7032
65 Y = 76.18 - 0.1012 Weeks 69.602
66 Y = 76.18 - 0.1012 Weeks 69.5008
67 Y = 76.18 - 0.1012 Weeks 69.3996
29
68 Y = 76.18 - 0.1012 Weeks 69.2984
69 Y = 76.18 - 0.1012 Weeks 69.1972
70 Y = 76.18 - 0.1012 Weeks 69.096
71 Y = 76.18 - 0.1012 Weeks 68.9948
72 Y = 76.18 - 0.1012 Weeks 68.8936
73 Y = 76.18 - 0.1012 Weeks 68.7924
74 Y = 76.18 - 0.1012 Weeks 68.6912
75 Y = 76.18 - 0.1012 Weeks 68.59
76 Y = 76.18 - 0.1012 Weeks 68.4888
77 Y = 76.18 - 0.1012 Weeks 68.3876
78 Y = 76.18 - 0.1012 Weeks 68.2864
79 Y = 76.18 - 0.1012 Weeks 68.1852
80 Y = 76.18 - 0.1012 Weeks 68.084
TABLE 4.13: The forecasted sales from week 61 to 80 for Item 3
Weeks Formula for item 3 Answer
61 Y = 81.59 - 0.4956 Weeks 51.3584
62 Y = 81.59 - 0.4956 Weeks 50.8628
63 Y = 81.59 - 0.4956 Weeks 50.3672
64 Y = 81.59 - 0.4956 Weeks 49.8716
65 Y = 81.59 - 0.4956 Weeks 49.376
66 Y = 81.59 - 0.4956 Weeks 48.8804
30
67 Y = 81.59 - 0.4956 Weeks 48.3848
68 Y = 81.59 - 0.4956 Weeks 47.8892
69 Y = 81.59 - 0.4956 Weeks 47.3936
70 Y = 81.59 - 0.4956 Weeks 46.898
71 Y = 81.59 - 0.4956 Weeks 46.4024
72 Y = 81.59 - 0.4956 Weeks 45.9068
73 Y = 81.59 - 0.4956 Weeks 45.4112
74 Y = 81.59 - 0.4956 Weeks 44.9156
75 Y = 81.59 - 0.4956 Weeks 44.42
76 Y = 81.59 - 0.4956 Weeks 43.9244
77 Y = 81.59 - 0.4956 Weeks 43.4288
78 Y = 81.59 - 0.4956 Weeks 42.9332
79 Y = 81.59 - 0.4956 Weeks 42.4376
80 Y = 81.59 - 0.4956 Weeks 41.942
The calculations were done using Microsoft Excel and the regression equations were
performed by question #3 (Minitab).
Discussion from regression:
A Ranking was made from the top 10 items bought in the retail store the total volume of the item
17 which is the most sold item was 7776 units and the lowest item 11 sold 2365 in two years.
A time series was performed for each item type and it was observed that from week 44-49
and 52-55 there were no units bought in the retail store, this probability means that store was
closed during those weeks or it was in maintenance.
31
R-squared is a statistical measure of how close the data are to the fitted regression line. It
is also known as the coefficient of determination, or the coefficient of multiple determination for
multiple regression is measured in percentage between 0-100. From the data observe the item
with a greater R^2 is item 2 which is around %12.80 and item 5 which is %10.9 usually is better
to take a look at the adjusted R-squared which has less errors and is more exact. However from
this data we can conclude that most of this items have low regression items which is not good for
the data and because of this the fitted values will be off..
The top 3 items bought for week 1-60 are 17,12 and 3. The quantity of units bought are
4487, 3815 and 3526. The fitted value did not show any trend in the plot. After obtaining the
regression equation we predicted the total weekly sales quantity for weeks 61-80 for each item
and we then obtained from the results that the top item sold was 17, then 12 and 3 respectively.
Question 5:
K-means clustering was performed for three sets of items. As mentioned above in the
methods section, eggs and ice cream were the first set of items chosen. Using excel, the data was
clustered separately into three, four, five and six clusters and the centroid of each cluster was
then found. The resulting scatter plot of eggs and ice cream with k=3 is shown below in figure
5.1 . As shown below, cluster 1 represents the customers that purchase a moderate amount of
these items. Cluster 2 represents the customers that purchase low volumes of eggs and ice cream
and cluster 3 contains the customers that purchase these items in abundance.
32
Figure 5.1: K means clustering for Eggs vs. Ice cream for K=3
Figure 5.2: K means clustering for Eggs vs. Ice cream for K=4
33
Figure 5.2 shows the K means clustering for eggs vs ice cream for k=4. Cluster number 1
represents the customers that bought a moderate amount of eggs and ice cream. Cluster 2
represents a smaller cluster of customers that bought a more moderate amount of ice cream but a
much larger amount of eggs. Cluster 3 represents customers that bought a minimum of each item
and cluster 4 represents the customers that bought a more moderate amount of eggs but a very
high amount of ice cream.
Figure 5.3: K means clustering for Eggs vs. Ice cream for k=5
Figure 5.4: K means clustering for Eggs vs. Ice cream for K=6
The same analysis was repeated for the Pizza vs Snacks example as well and the results
are shown below in figure 5.5. Pizza and snacks were chosen together because they are very
similar and they are two items that often go together as for parties etc. Cluster 1 as shown in blue
represents the customers that purchased a low volume of pizza and snacks. Cluster 2 as shown in
red represents the customers that purchased a high volume of snacks and pizza, whereas cluster 3
represents the customers that purchased a moderate amount of these items. Pizza is represented
on the x axis and snacks on the y axis.
34
Figure 5.5: K Means clustering for Pizza vs. Snacks for k=3
Figure 5.6: K means clustering for BBQ vs Cereal for K=3
BBQ and cereal were chosen as two random items that would not normally go together.
Cluster 1 in blue represents customers who bought a low volume of both items. Cluster 2 in red
represents customers who bought a low volume of BBQ (x-axis) and a higher volume of cereal
35
(located on the y-axis). Cluster 3 in green represents customers who bought a moderate amount
of both items.
Question 6:
Out of the 494 families in the data and the ones that responded, it was found that only 9
families did not have any TV’s in their homes. This is a mere 2% of the data and is important for
advertisement purposes. If 98% of your customers have a television in their homes, the chance of
catching their attention through a television advertisements would most likely be the greatest.
The average income of the data was equivalent to 6 which represents the $25,000-$35,000 range.
This range is lower on the income scale and it is an important thing to acknowledge from the
store owner’s perspective. In order to be successful, it is important to know information about
your customers and the financial background they are coming from. Figure 6.1 below is a bar
graph that shows the distribution of the customers in the database according to their income
number which is found in the data dictionary, as you can see the majority of the customers fall in
between income range 6 and 7.
The countifs function was used on the demographics data in Excel to calculate the number
of households with dogs and cats. It was found that 169 households had one or more dogs and 79
had one or more cats. Those totals together are only equal to about half of the families in the
database which can be important to store owners because maybe money can be saved by cutting
back on animal sections in the store etc. It was also found that 488 of the families had at least one
child, that is merely 99% of the customers. Every household that had a cat or dog also had at least
one child. This is important for many purposes such as advertising methods, sales, and coupon
distribution.
As far as magazine subscriptions, it was found that the highest amount of subscriptions
was 125 for Reader’s Digest and the smallest was 12 for Cosmopolitan, followed by 13 for
Glamour. Not a lot of information can be obtained from this information besides the fact that
advertisements through magazines would not be the best option for the store because not enough
of the customers have subscriptions.
36
Figure 6.1 Distribution of customers by income range
37
Contents
1Table of Contents
1.) Introduction……………………………………………………………………………….37
2) Summary………….………………………………………………………...……………...37
3.) Methods…………………………………………………………………………………...38
2.1) Description of Data………………...…………………………………………….38
2.2) Database and Table Schema………………………………….…………………..38
2.3) Queries……………………………………………………….…………………..42
3.) Results……………………………………………………………………………………42
3.1) Description……………………………...……………………………………….42­60
4.) Discussion………………………………………………………………………………..61
4.1) Gender………………………………………………………………………….62
4.2) Income………………………………………………………………………….63
4.4) Weeks of the year………………………………………………………………...63
5.) Conclusion……………………………………………………………………………….66
6.) Appendix: Query Results and Graphs……………………………………...……………66­71
1 Introduction
By using Microsoft Access the group will be able to provide an essential data analysis of
the store for the transaction and demographics data, in order to make effective recommendations
about areas for additional research and effective marketing and management strategies. With this
new database the retail store will be able to make better decisions for the growth of their
business.
38
The group will analyze the data from the transaction and the demographics tables and
draw conclusions by pointing out the correlated items to obtain predictions. The trends will be
shown in graphics later on. Finally the group will advise the retail store on some
recommendations for improvement
Microsoft Access will be the method that the group will use to analyze the data. The first
objective is to create the tables in order to do the relationships. The demographics and transaction
data will be broken down into four tables, which are
∙ Demographics
∙ Transaction
∙ ItemType
∙ CouponID
Each table is conformed by a primary key and foreign keys to make relationships within
the tables, there will not be redundancy in any of this tables, our group made sure to put time and
effort in order to apply organized tables to link relationships.
From this relationships queries will be performed to obtain a sense strategy the company
should apply to increase the sales. The idea of creating queries is to obtain a presentable and
organized data in order to run reports for the retail store in an efficient manner.
Twenty queries will be generated in Microsoft Access, after the data is organized the
most influential categories will be exported to Microsoft Excel to highlight the key components
and trends in the data. After obtaining the results it will be easy from the team to obtain
conclusions and recommendations for the retail store.
Given the variety and amount of data, A number of analysis and visualization techniques
where done in order to generate conclusions. The goal is to take the wide variety of useless data
and transformed into a meaningful one. By carefully creating queries by visualization techniques
it will be easier to obtain conclusions since they will be able to connect each other
Summary
The objective of this assignment was to analyze the transaction and customer demographic
data given in the project description, in order to draw conclusions by performing queries on the
given data and be able to give recommendations and predict future trends. Microsoft Access, and
Microsoft Excel were used in order to visualize and evaluate the data. Using these methods, the
most useful data was considered and these observations were used to formulate strategies to
improve the store business strategy.
In the demographics tables many attributes were examined in great detail such as income,
gender, work hours, and ethnicity. A relationship was made with the transaction data in order to
understand the patterns. It was found that income, work hours , gender and weeks are key
attributes that have a large impact on purchasing items from the store.
39
2. Methods
2.1 Description of Data
From this report two sets of data are given which are the transaction and the
demographics data. The transaction data has relevant information, the primary Key in the table is
the customer ID, This will allow the data analyst to view the transaction made by the customer
including the week and the day that he made this purchase.
The customer ID listed in the transaction table is located in the demographics table as
well. By having this link the retail store can obtain personal information about the specific
customer such as ethnicity, family size, income, education level, among others.
The analyst can use the demographics data to analyze the customer base and profile of the
customer which will allow the manager to obtain a better knowledge of the group that is coming
to the retail store and what items are their buying and in what date is the transaction being made.
With this information the analyst can observe behaviors and trends that customers have with their
personal information. From this, a variety of conclusions can be done.
2.2 Access Database and Table Schema
Tables were provided from the data received from our retail stored from this project
concerning the transaction and the demographics data. The team focused on efficiency while
creating these tables by adding the minimum number of tables and attributes. The concise tables
are: Transaction, Demographics, ItemType and CouponID. Each of them has their own primary
key, after doing the relationship among the tables it was easy to create queries and reports. With
this results it is very helpful to analyzing trends and relationships.
1.) Demographics table: This table captures all of the customer demographic data
provided. The primary key is the unique customer ID number. The table also
contains info about the customer’s family, income, education level and
subscriptions to magazines.
40
Figure 1: Demographics Table
2.) Transaction table: The transaction table uses the unique transaction ID as the
primary key. It organizes the information contained in each individual customer
transaction. Information contained in this table includes the type and number of
the item being purchased, the day and week of the purchase and coupon usage.
41
Figure 2: Transaction Table
3.) Coupon table: The coupon table uses a unique Coupon ID as the primary key. This
table contains information about the coupons available to the customers and their
dollar value.
Figure 3: Coupon Table
4.) Item table: The item table uses the unique item type as the primary key. The data
dictionary includes 24 item types unique to each value.
Figure 4: Item Table
After the creation of the database tables, the relationships amongst them were made and defined.
The referential integrity was enforced amongst them and is shown below in Figure 4. The
primary keys of each table are included in the other tables as foreign keys.
42
Figure 4: Database Schema
2.3 Queries
43
A total of 20 queries were executed in Microsoft Access in order to analyze the
relationships among the various data categories. A wide range of the data categories were
considered and the most important ones were included in the queries. For example, it was
decided that the education level of customers was not a key area of concern, since the items listed
in the database are generic and widely used by the general public and most likely are not affected
by education level. The results of these queries, along with the SQL code produced are displayed
below.
3.0 Results
1.) This query displays the sum of the units bought for each item in descending order
SQL code:
SELECT ItemType.Description, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought]
FROM ItemType INNER JOIN [Transaction] ON ItemType.[Item Type] = Transaction.[Item
Type]
GROUP BY ItemType.Description
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
The first query shown below is a calculation of the total number of units sold per item.
This table shows that snacks are the most sold item and pills are the least sold item. This
information is important to store management and vital for their success. It is important to have a
large enough inventory of these products to satisfy the customer demand. The placement of these
products in the store can be important for advertisement purposes such as leading customers to
buy other products.
44
Figure 5
2.) This query displays the total sum of units bought from each day of the week
SQL code:
SELECT Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Transaction.Day
FROM [Transaction]
GROUP BY Transaction.Day;
Description:
The second query shown below shows the total number of items that were bought on each
day. The data shows that the most items were purchased on day 5 and 6 which represents Friday
and Saturday. Using the countifs function in excel, it was found that only 2 out of the 495 given
families had both the male and female parent unemployed. This means that about 99% of the
families in the data have at least one family member who is employed, making activities such as
shopping more convenient on the weekend. This data indicates the store will be the busiest on
Fridays and therefore should have extra employees for stocking shelves and check out lines in
order to keep the customers satisfied.
45
Figure 6
3.) This query counts the number of families of size 4, 5 and 6 with cable TV.
SQL code:
SELECT Count(Demographics.[Cable TV]) AS [CountOfCable TV], Demographics.[Family
Size]
FROM Demographics
GROUP BY Demographics.[Family Size], Demographics.[Cable TV], Demographics.[Family
Size]
HAVING (((Demographics.[Cable TV])=1) AND ((Demographics.[Family Size])>=4));
Figure 7
Description:
This query provides a count of families of size 4, 5 and 6 that have cable TV. The results
show that 16 families of 6 or more people, 24 families of 5 people and 59 families of 4 people
have cable TV. These totals refer strictly to the larger family size indicating that there is most
likely a variety of aged people living in the households. This information could be useful to
television marketers who are looking to attract a specific age range of people within these larger
families.
4.) This query displays the days of the week in descending order based upon when part
time (working less than 35 hours a week) women shop the most
SQL code:
46
SELECT Transaction.Day, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought],
Demographics.[Female Work Hours]
FROM CouponID INNER JOIN (Demographics INNER JOIN [Transaction] ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON CouponID.CouponID =
Transaction.CouponID
GROUP BY Transaction.Day, Demographics.[Female Work Hours]
HAVING (((Demographics.[Female Work Hours])=3))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
The fourth query shown below shows the days of the week that the women who work part
time tend to shop on the most. The results show the majority of them tend to do their shopping
towards the end of the week, with the most shopping done on Saturdays. This is similar to the
results shown above in figure 6, which shows the most items being sold on Friday and Saturday.
These results relate back to the importance of having extra staff on the weekends in order to keep
the customers satisfied and in and out of the store in a timely fashion.
Figure 8
5.) This query displays the days of the week in descending order based upon when part
time (working less than 35 hours a week) men shop the most
SQL code:
SELECT Transaction.Day, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought],
Demographics.[Male Work Hours]
FROM Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] =
Transaction.[Custome ID]
GROUP BY Transaction.Day, Demographics.[Male Work Hours]
47
HAVING (((Demographics.[Male Work Hours])=3))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
The fifth query shown below shows the days of the week that the men who work part time
tend to shop the most. The male work hours column that is equal to 2 refers back to the data
dictionary which shows a value of 2 equaling work hours of less than 35 hours a week. The
results show that the majority of part time working men tend to do their shopping towards the end
of the week on days 5 and 6. In this regard, the data displays the same results as shown above for
part time working women. The analysis also showed the major difference in part-time working
males who shopped compared to women. As shown below in figure X, the most shopping was
done on day 6 by 366 male citizens in comparison with the 1750 female shoppers on day 6. This
information is important to the store owner because it tells them about the shopping crowd they
will be expecting.
Figure 9
6.) This query displays the top 5 customer ID’s of the people who purchased the most
ice cream
SQL code:
SELECT TOP 5 Transaction.[Custome ID], Transaction.[Item Type], Sum(Transaction.[Units
Bought]) AS [SumOfUnits Bought]
FROM [Transaction]
GROUP BY Transaction.[Custome ID], Transaction.[Item Type]
HAVING (((Transaction.[Item Type])=13))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
48
Description:
The sixth query shown below shows the top 5 customers that bought the most of item 13
which is ice cream. It shows the most ice cream bought by a single customer was equal to 90.
This information can be used to analyze specific demographic information about these customers
and to form any relationship among them that could increase sales in the future. This information
can be evaluated for any item and is also important for the distribution of coupons to increase the
chance of the best customer’s returning to the store. The customers will in turn be rewarded for
their top purchases.
Figure 10
7.) This query displays the items in order from the most to least purchased for
customers that make an income of $15,00-$20,000.
SQL code:
SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description,
Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income
FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]
GROUP BY ItemType.Description, Demographics.Income
HAVING (((Demographics.Income)=4))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
49
Figure 11 below shows the most purchased items by customers within the income range of
15-20,000
Description:
The query number #7 shows The most purchased item by customers who make between
$15,000 and $20,000 is eggs, followed by snacks and butter. Furthermore, since the total number
of transactions for each item is fairly close to the total number of each item purchased, these
customers are more likely to only buy one item at a time. Encouraging “buying in bulk” by
offering discounts when customers purchase more than one item could increase the number of
items bought per transaction.
8.) This query displays the number of items sold in order from highest to lowest for
customers with an income range of greater than $75,000
SQL Code:
SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description,
Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income
FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]
GROUP BY ItemType.Description, Demographics.Income
HAVING (((Demographics.Income)=11))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
50
Query number 8 shows the most purchased item for customers with an income range
greater than $75,000 is cat food. This information may be useful to analyze other important
demographic information about the customers in this income range such as number of pets. The
following top selling items are snacks, cereal and crackers which are common to all of the
income ranges. This shows that these common items are most likely immune to income
discrepancies and are purchased by a majority of the customers.
Figure 12 below shows the most purchased items by customers within the income range of
greater than 75,000
9.) This query displays the number of items sold in order from highest to lowest for
customers with an income range of $55,000-$65,000.
SQL Code:
SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description,
Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income
51
FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]
GROUP BY ItemType.Description, Demographics.Income
HAVING (((Demographics.Income)=9))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
As shown in Figure 13 below, for customers that make an income between
$55,000-65,000 eggs were the top selling item followed by snacks and butter. These are
common, everyday grocery items, the same that were observed to be purchased very frequently
by many other customer demographics, including lower income groups. This suggests that these
items may be immune to income discrepancies and are widely consumed by households within a
wide range of financial situations.
Figure 14
10.) This query displays the day of the week that retired women shop the most
52
SQL Code:
SELECT Transaction.Day, Count(Demographics.[Female Work Hours]) AS [CountOfFemale
Work Hours], Demographics.[Female Work Hours]
FROM Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] =
Transaction.[Custome ID]
GROUP BY Transaction.Day, Demographics.[Female Work Hours]
HAVING (((Demographics.[Female Work Hours])=4))
ORDER BY Count(Demographics.[Female Work Hours]) DESC;
Description:
The results below show that the majority of retired women shop on Fridays and
Saturdays. This is similar to the shopping patterns found where it shows the most items are sold
at the store on Fridays and Saturdays.
11.) This query displays the days of the week that retired men shop the most
SQL Code:
SELECT Transaction.Day, Count(Demographics.[Male Work Hours]) AS [CountOfMale Work
Hours], Demographics.[Male Work Hours]
FROM Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] =
Transaction.[Custome ID]
GROUP BY Transaction.Day, Demographics.[Male Work Hours]
HAVING (((Demographics.[Male Work Hours])=4))
ORDER BY Count(Demographics.[Male Work Hours]) DESC;
Description:
53
The table below shows the days in which retired men go shopping the most in order from
highest to lowest. The results show that the majority of retired men go shopping on Fridays
which is the same as the results of retired women.
12.) This query shows how the sales for one particular item, pizza, vary from
week to week. (Appendix 1)
SQL code:
SELECT Count(Transaction.[Item Type]) AS [CountOfItem Type], Transaction.Week,
Transaction.[Item Type]
FROM [Transaction]
GROUP BY Transaction.Week, Transaction.[Item Type], Transaction.[Item Type]
HAVING (((Transaction.[Item Type])=16))
ORDER BY Count(Transaction.[Item Type]) DESC;
Explanation:
This query showed how the sales of one particular item, in this case pizza, vary from
week to week. In this case, the sales don’t fluctuate too much among each other for any of the
given weeks. The sales of pizza are pretty steady as shown which means there should be a
constant inventory of it in the store.
13.) This query shows an average of each unit bought for families with children.
An average value of units bought is given for values 1-8, which the data dictionary indicates
are all families with children. As expected, the numbers increase as the number of children
in the family increases. The results are shown in Appendix I. SQL Code:
SELECT Demographics.Children, Transaction.[Item Type], Avg(Transaction.[Units Bought])
AS [AvgOfUnits Bought]
FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON
54
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]
GROUP BY Demographics.Children, Transaction.[Item Type]
HAVING ((Not (Demographics.Children)=0));
14. This query shows the number of coupons and of what origin, that each ethnicity
used
SQL Code:
SELECT Demographics.Ethnicity, Count(CouponID.CouponOrigin) AS CountOfCouponOrigin,
CouponID.CouponOriginFROM ItemType INNER JOIN (Demographics INNER JOIN
(CouponID INNER JOIN [Transaction] ON CouponID.CouponID = Transaction.CouponID) ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]GROUP BY Demographics.Ethnicity, CouponID.CouponOrigin;
Description:
55
This query shows the number of coupons that each ethnicity used. As the data dictionary
states the ethnicity value of 1 represents white people and coupon origin 19 represents ActNow.
As the figure above shows for example, there was one person from the white ethnic group that
used a coupon from ActNow. This information could be very important to the store owner in
order to target specific demographic areas with specific coupons. This information also shows
black people (ethnicity value 2) are the largest ethnic group that used zero coupons with 4271
people using zero. This information could also be useful to target these people with more
coupons and try to increase the store’s sales.
15. Top items that people of Ethnicity number 3 bought (Hispanic)
SQL Code:
SELECT ItemType.Description, Demographics.Ethnicity, Sum(Transaction.[Units Bought]) AS
[SumOfUnits Bought]
FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]
GROUP BY ItemType.Description, Demographics.Ethnicity
HAVING (((Demographics.Ethnicity)=3))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
The results shown below show that the top purchased item by people of ethnicity number
, which according to the data dictionary is Hispanic, was cereal which was followed by snacks
and detergents.
56
16. Top purchased items by ethnicity number 2 (Blacks)
SQL Code:
SELECT ItemType.Description, Demographics.Ethnicity, Sum(Transaction.[Units Bought]) AS
[SumOfUnits Bought]
FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]
GROUP BY ItemType.Description, Demographics.Ethnicity
HAVING (((Demographics.Ethnicity)=2))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
The results shown below show that the top purchased item by people of ethnicity number
2, which according to the data dictionary is Black, was eggs which was followed by snacks and
butter.
57
17.) This query displays the total number of items sold in the weeks of the summer
SQL Code:
SELECT ItemType.Description, Transaction.Week, Sum(Transaction.[Item Type]) AS
[SumOfItem Type]
FROM ItemType INNER JOIN [Transaction] ON ItemType.[Item Type] = Transaction.[Item
Type]
WHERE (((Transaction.Week)>=26 And (Transaction.Week)<=38)) OR
(((Transaction.Week)>=78 And (Transaction.Week)<=90))
GROUP BY ItemType.Description, Transaction.Week
ORDER BY Transaction.Week, Sum(Transaction.[Item Type]) DESC;
Description:
This query shows that the customer’s buying behaviors are very similar to the buying
behaviours they exhibit throughout the rest of the year. The top selling item as shown below is
snacks which is common among all of the evaluations. The following top selling items are eggs,
o
58
crackers and ice cream. This informati n shows that it may be beneficial to the company to offer
sales on these items throughout the weeks of summer.
(rest displayed in access file!)
18.) This query displays the number of items sold in order from highest to lowest
for customers with an income range of greater than $45,000-$55,000
SQL Code:
SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description,
o
59
Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income
FROM ItemType INNER JOIN (Dem graphics INNER JOIN [Transaction] ON
Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] =
Transaction.[Item Type]
GROUP BY ItemType.Description, Demographics.Income
HAVING (((Demographics.Income)=8))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
Query 18 shows the top selling item for customers with an income range between
$45,000 and $55,000 was cat food followed by snacks, butter, eggs and cereal.
19.) This query shows the number of each item that was sold per day for days 1-7.
o
60
SQL Code:
SELECT Transaction.Day, ItemType.Description, Sum(Transaction.[Units Bought]) AS
[SumOfUnits Bought]
FROM ItemType INNER JOIN (Coup nID INNER JOIN [Transaction] ON
CouponID.CouponID = Transaction.CouponID) ON ItemType.[Item Type] = Transaction.[Item
Type]
GROUP BY Transaction.Day, ItemType.Description;
Description:
The results from query 18 shown below show the sum of each item purchased on days
1-7. For example as shown in figure X below, 181 units of bacon were purchased on Monday in
comparison to 193 units of bacon sold on Tuesday. The sum of units of ice cream bought
significantly increases on days 5 and 6 which is Friday and Saturday. This makes sense because
its the weekend and people will have a greater chance of wanting ice cream on weekends for
activities such as party’s etc. This variety in purchases shown can be important to store owners to
predict sales and inventory.
o
61
62
20. This query shows the top 5 items sold on day 6 (Saturday)
SQL code:
SELECT Top 5 Transaction.Day, ItemType.Description, Sum(Transaction.[Units Bought]) AS
[SumOfUnits Bought]
FROM ItemType INNER JOIN (CouponID INNER JOIN [Transaction] ON
CouponID.CouponID = Transaction.CouponID) ON ItemType.[Item Type] = Transaction.[Item Type]
GROUP BY Transaction.Day, ItemType.Description
HAVING (((Transaction.Day)=6))
ORDER BY Sum(Transaction.[Units Bought]) DESC;
Description:
As shown below in figure X, the top 5 items bought on Saturdays are snacks, eggs, butter,
cereal and cook. This information can be very important to store store owners to predict inventory
and sales. This information can also help in the planning of item sales which would draw in
additional customers to the store.
2.5 Discussion:
Once the database, queries and graphs were created, important trends were observed from the data.
A summary of the data analysis is provided in 4 main categories
∙ Gender
∙ Income
∙ Week of the year
∙ Work hours
Work hours:
The different work hours among the customers in the database were examined for any
potential buying patterns or trends. It was found that the most popular day to shop for part time men
63
and women was on Saturdays. Although the number of women part time shoppers greatly surpassed
the number of males with a count of 1750 compared to 366, it was found that both genders
preferred to do their shopping on Saturdays.
As shown in query 10 and 11, the same calculation was done for retired males and females
and the results showed that the most popular day to shop for both genders was on Friday. In this
case, the amount of Friday shoppers for both male and females nearly doubled the Saturday
shoppers but Saturdays were still the second busiest day to shop. All of these results show that the
stores are the busiest on the weekends. This concept is very important for store managers to
recognize in order to have the proper staff on hand to keep customers happy and to have a fully
stocked inventory to avoid running out of items.
Gender
Specific trends related to gender were discovered when analyzing the queries our group
created, which lead us to many important conclusions. The female customers in the data bought a
much higher amount of items compared to men making them the bigger overall shoppers in the
database. From the given query results we can see the comparison between male and female part
time workers and their shopping habits (Units bought vs Hours of work). This particular example
shows that in total on Saturday’s the women bought 1750 items while men just bought 366 items.
This difference in gender is surprisingly high.
The same pattern was not present when we analyzed the shopping patterns of retired males
and females. From the queries obtained we can observe that the retired men spent a similar
amount of time shopping as women and the total of their items bought were also very similar. In
other words, retired men shop as equal as retired female. We also found that the most frequent
day that retired individuals go shopping is on Friday and then Saturday. This may be the result of
some kind of sales that may occur in the shop on these days. The total number of items retired
males bought on Fridays was 4232 in comparison to 4297 for retired females.
Income
After finalizing the queries and the results, the group came to the conclusion that
customers income did infact have a large influence on the customers buying behavior. At first
we can see from query #7 that customers with a low income, between $15,000-$20,000 most
commonly bought item was eggs followed by snacks and butter. The count of the item purchased
was also very close to the sum of units bought, indicating that the majority of these customers
buy each item one at time. One way to possibly increase these numbers is to increase the
coupons distributed for these items to these specific demographic regions.
As a commonality item among most of the data, item number 17 which was snacks
seemed to be the most popular. However, the most purchased item within each income range
differed slightly. From the lowest income group to the highest income group, the most frequently
purchased items were as follows: snacks, cat food, eggs, cereal, and cat food. Although the
a
64
distinct reason for these observations is not clear, more research may be beneficial to the store.
Knowing that these are the most frequently purchased items, management at the retail store could
make some rearrangements to the store layout. For example, they could place these items towards
the front of the store so that customers could find them easily, or they could place these items in
the back of the store so that customers would have to walk past many different items on their
way, and potentially buy some items that they weren’t looking for. Overall, customer income
does have an affect on customer’s overall buying behavior at the store, both with the coupons
they use and the items they purchase.
Weeks of the year:
Another factor that was analyzed from the data was how the sales of certain items changed
from week to week. One of our queries (number 12), showed the different sales from one week to
another for item number 16 which was pizza. As the results given in Appendix I show, regardless
of season or week, the sales in pizza did not differ much from one another. There were one or two
fluctuating weeks in the whole results but this fluctuation could have been a result of a birthday
party or another special occasion that happens once or twice a year. There were no major
fluctuations in pizza sales from one week to another.
The group also analyzed the top selling items in the weeks of the summer and it was found
that snacks were the highest selling item followed by eggs and crackers. These items are
commonality items that were found in many of our analysis as the top sellers. From one week to
another there were slight changes within the units of these items purchased but nothing drastic.
This means that the store does not need to make any special preparations for the summer season
and can save money that may have been spent here and spend it elsewhere.
Figure A-1: Sum of Units bought by day of the week
This pie chart shows the percentage of items bought by day. From this chart we can clearly
tell that the most sales occur on fridays and thursdays. This is very useful in determining when the
65
perfect day would be to restock the store. This information was obtained from a query created in
access showing the relation of the amount of items bought by each day.
Figure A-2: Graphs showing Ice cream and Soft Drinks during the summer
The two graphs shown here were created by finding the relation of items sold during the
summer. I then specifically selected Ice Cream and Soft Drinks as my 2 choices for comparison
because they both seem to have a similar pattern over the summer. For the most part they tend to have
similar spikes around the s me time periods of the 1st and 2nd summers. This information can help
stores prepare and plan their inventory for the these time periods to optimize sales.
Figure A-3: Pie Chart of percentage of items sold by item type
This Pie chart can also help the stores prepare their stock. They can use this to rank how
popular an item type is. The more popular the item type is, tells you that you should consider
having a wider range of this specific type. You can try stocking different flavors or different brands.
a
66
Also for the items that are not so popular, you can consider either advertising it better so that it sells
more, or even slowly decrease the amount of this item you stock on the shelves.
Figure A-4: Graph of the Items most bought by people with Income 10
Stores should design their stores and inventory around the customers of the area. If there is
a vast number of a specific demographic for example this figure, taking the items that they mostly
buy can give insight on how to design your store around your customers. This idea optimizes
revenue and helps gain loyal customers.
Figure A-5
These two pie charts show the percentage of items bought by males and females, respectively,
who work over 35 hours. Using this information you can better understand your customers needs
and cater to them. For example, since most males that work over 35 hrs shop on a wednesday, you
67
can then make small changes that would make the shop a more suitable environment for that
customer without losing the other customers interest completely.
Conclusion:
This project served as great practice in taking a large quantity of meaningless data,
analyzing, and then transforming it into meaningful information.The first part of the project was
used to see how reliable this data is. The calculations from the MAD and MSE show that this
data is in fact reliable. The method prefered would be the exponential smoothing with alpha=0.2
because the MAD values are lower for the ES than the 3pt moving average for both chosen items.
Also the TS was reasonably low so there also seemed to be very little bias. Many relationships
were established among the various categories of data in order to observe basic patterns and
trends. The queries created in Access proved to be one of the most essential parts of this project.
The queries allowed the analyst to view only specific and desirable attributes of the data and
make conclusions about the data from these results. This also allows the analyst to obtain very
specific information bout the data and to observe how specific attributes interact with one
another.
Another crucial takeaway from this project is the importance of data visualization. The use
of various charts and graphics to display data makes patterns and trends a lot easier to interpret. In
conclusion, data analysis as a whole can be complicated but it is absolutely vital in order to make
valuable and correct conclusions from large volumes of data.
Appendix I:
Table A-1: 3 Point Moving average for cat food
a
68
Table A-22: Exponential smoothing for cat food (alpha=0.2)
Figure A-6: Graph Exponential Smoothing for cat food (alpha=0.2)
Table A-3: 3 Point Moving average for cereal
69
Table A-4: Exponential smoothing for cereal (alpha=0.2)
Figure A-7: Graph Exponential Smoothing for cereal (alpha=0.2)
70
Full query results located in Access file
71
Roles
Wilson: Moving Average and Exponential Smoothing, Query graphs,
Jessica: K-means graphs and question 6 (open ended question), Queries and descriptions
Aaron: Regression, Queries and descriptions

More Related Content

Viewers also liked

Alimentos plato del buen comer lilo
Alimentos plato del buen comer liloAlimentos plato del buen comer lilo
Alimentos plato del buen comer lilo
Adriana Bejar
 
Freedom riders
Freedom riders Freedom riders
Freedom riders
em1307
 
14 08-21 sem-pav_5-luis-hinrichs-dsc
14 08-21 sem-pav_5-luis-hinrichs-dsc14 08-21 sem-pav_5-luis-hinrichs-dsc
14 08-21 sem-pav_5-luis-hinrichs-dsc
Instituto del cemento y Hormigón de Chile
 
Template research
Template research Template research
Template research
roxanisk
 
Travel to san andres
Travel to san andresTravel to san andres
Travel to san andres
miryamgaona0
 
Моделирование затопление территорий и ArcGIS
Моделирование затопление территорий и ArcGISМоделирование затопление территорий и ArcGIS
Tourism in the UK
Tourism in the UK Tourism in the UK
Tourism in the UK
em1307
 
Alimentos plato del buen comer lilo
Alimentos plato del buen comer liloAlimentos plato del buen comer lilo
Alimentos plato del buen comer lilo
Adriana Bejar
 
Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...
Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...
Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...
Instituto del cemento y Hormigón de Chile
 
Wait time for treatment in hospital ED
Wait time for treatment in hospital EDWait time for treatment in hospital ED
Wait time for treatment in hospital EDAaron Fuhrman
 
Seminario n ch 170
Seminario n ch 170Seminario n ch 170

Viewers also liked (12)

Alimentos plato del buen comer lilo
Alimentos plato del buen comer liloAlimentos plato del buen comer lilo
Alimentos plato del buen comer lilo
 
Freedom riders
Freedom riders Freedom riders
Freedom riders
 
14 08-21 sem-pav_5-luis-hinrichs-dsc
14 08-21 sem-pav_5-luis-hinrichs-dsc14 08-21 sem-pav_5-luis-hinrichs-dsc
14 08-21 sem-pav_5-luis-hinrichs-dsc
 
Template research
Template research Template research
Template research
 
Travel to san andres
Travel to san andresTravel to san andres
Travel to san andres
 
Моделирование затопление территорий и ArcGIS
Моделирование затопление территорий и ArcGISМоделирование затопление территорий и ArcGIS
Моделирование затопление территорий и ArcGIS
 
Tourism in the UK
Tourism in the UK Tourism in the UK
Tourism in the UK
 
Alimentos plato del buen comer lilo
Alimentos plato del buen comer liloAlimentos plato del buen comer lilo
Alimentos plato del buen comer lilo
 
Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...
Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...
Seminario Internacional Construcción y Rehabilitación con Pavimentos de Hormi...
 
Air Quality Index
Air Quality IndexAir Quality Index
Air Quality Index
 
Wait time for treatment in hospital ED
Wait time for treatment in hospital EDWait time for treatment in hospital ED
Wait time for treatment in hospital ED
 
Seminario n ch 170
Seminario n ch 170Seminario n ch 170
Seminario n ch 170
 

Similar to SQL

Facilities planning and production management
Facilities planning and production managementFacilities planning and production management
Facilities planning and production managementSerkan Alan
 
Demand forecasting methods 1 gp
Demand forecasting methods 1 gpDemand forecasting methods 1 gp
Demand forecasting methods 1 gp
PUTTU GURU PRASAD
 
Sheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docx
Sheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docxSheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docx
Sheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docx
lesleyryder69361
 
Forecasting
ForecastingForecasting
Forecasting3abooodi
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsHellen Gakuruh
 
Chap011
Chap011Chap011
Presentation 3
Presentation 3Presentation 3
Presentation 3
uliana8
 
forecastingtechniques 2.pdf
forecastingtechniques 2.pdfforecastingtechniques 2.pdf
forecastingtechniques 2.pdf
vijay511413
 
BUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docxBUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docx
humphrieskalyn
 
Cmc chapter 02
Cmc chapter 02Cmc chapter 02
Cmc chapter 02Jane Hamze
 
Chapter-3_Heizer_S1.pptx
Chapter-3_Heizer_S1.pptxChapter-3_Heizer_S1.pptx
Chapter-3_Heizer_S1.pptx
EdwardDelaCruz14
 
Zontos_ep410_Report1_SSN
Zontos_ep410_Report1_SSNZontos_ep410_Report1_SSN
Zontos_ep410_Report1_SSNZoe Zontou
 
Forecasting
ForecastingForecasting
Forecasting
mrinalmanik64
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptx
efrembeyene4
 
Stem and-leaf-diagram-ppt.-dfs
Stem and-leaf-diagram-ppt.-dfsStem and-leaf-diagram-ppt.-dfs
Stem and-leaf-diagram-ppt.-dfs
Farhana Shaheen
 
Holtwinters terakhir lengkap
Holtwinters terakhir lengkapHoltwinters terakhir lengkap
Holtwinters terakhir lengkap
Zulyy Astutik
 
Chapter 16
Chapter 16Chapter 16
Chapter 16bmcfad01
 

Similar to SQL (20)

Facilities planning and production management
Facilities planning and production managementFacilities planning and production management
Facilities planning and production management
 
Demand forecasting methods 1 gp
Demand forecasting methods 1 gpDemand forecasting methods 1 gp
Demand forecasting methods 1 gp
 
Forecasting
ForecastingForecasting
Forecasting
 
Sheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docx
Sheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docxSheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docx
Sheet1Number of Visits Per DayNumber of Status Changes Per WeekAge.docx
 
Forecasting
ForecastingForecasting
Forecasting
 
SessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjectsSessionSix_TransformingManipulatingDataObjects
SessionSix_TransformingManipulatingDataObjects
 
Chap011
Chap011Chap011
Chap011
 
Presentation 3
Presentation 3Presentation 3
Presentation 3
 
Forecasting techniques
Forecasting techniquesForecasting techniques
Forecasting techniques
 
Supply Chain Planning Paper
Supply Chain Planning PaperSupply Chain Planning Paper
Supply Chain Planning Paper
 
forecastingtechniques 2.pdf
forecastingtechniques 2.pdfforecastingtechniques 2.pdf
forecastingtechniques 2.pdf
 
BUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docxBUSI 331Marketing Research Report Part 3 InstructionsData .docx
BUSI 331Marketing Research Report Part 3 InstructionsData .docx
 
Cmc chapter 02
Cmc chapter 02Cmc chapter 02
Cmc chapter 02
 
Chapter-3_Heizer_S1.pptx
Chapter-3_Heizer_S1.pptxChapter-3_Heizer_S1.pptx
Chapter-3_Heizer_S1.pptx
 
Zontos_ep410_Report1_SSN
Zontos_ep410_Report1_SSNZontos_ep410_Report1_SSN
Zontos_ep410_Report1_SSN
 
Forecasting
ForecastingForecasting
Forecasting
 
analysis part 02.pptx
analysis part 02.pptxanalysis part 02.pptx
analysis part 02.pptx
 
Stem and-leaf-diagram-ppt.-dfs
Stem and-leaf-diagram-ppt.-dfsStem and-leaf-diagram-ppt.-dfs
Stem and-leaf-diagram-ppt.-dfs
 
Holtwinters terakhir lengkap
Holtwinters terakhir lengkapHoltwinters terakhir lengkap
Holtwinters terakhir lengkap
 
Chapter 16
Chapter 16Chapter 16
Chapter 16
 

SQL

  • 1. 1 Big Data Analysis Project Group #18 Aaron Fuhrman Jessica Demuro Wilson Tapia AJW Consultants, Inc. The Pennsylvania State University 12/9/14
  • 2. 2 Big Data Analysis Project Aaron Fuhrman Jessica Demuro Wilson Tapia AJW Consultants, Inc. The Pennsylvania State University 12/9/14 Table of Contents 1.) Introduction………………………………………………………………………………….2 2.) Methods………………………………………………………………………………….5 2.1) Moving Average and Exponential Smoothing………………………………....3 2.2) Regression Analysis…………….………………………………………………4 2.3) K-Means Clustering…………………………….……………………………...4 2.4) Regression……………………………………………………………………. 7 3) Results……………………………………………………………………………………5 Introduction: This assignment uses predictive analytics methods to analyze the given transaction data and make meaningful recommendations for the store manager to improve sales. Predictive analytics calculations including moving average, exponential smoothing, linear regression and k-means clustering are used and included in this report. There is a wide variety of data included in the transaction data that varies from week to week. The first calculation in this analysis is a moving average and exponential smoothing calculation which calculates the prediction error for weeks 10 through 20 of the given data. The Mean Squared Error, Mean Absolute Deviation, and Tracking Signal are then calculated from the given data. The next calculation involves a regression analysis of the given transaction data. Lastly, the k-means clustering algorithm is used to cluster the customers in the database based on their buying patterns. The data is then analyzed and used to make a final conclusion about the relationships existing among the customers and transaction data.
  • 3. 3 Problem Description: The customer base and transaction data is to be analyzed in order to develop appropriate strategies to increase sales. To do this, a series of predictive analytic methods will be used. Some of these methods include moving average and exponential smoothing, regression and k-means calculations. After these tests are performed, they will be analyzed in order to make recommendations to the management. Methodology: A variety of predictive analytics methods were used in this project. They are described in detail below. 2.1 Moving Average and Exponential Smoothing This part of the database analysis entailed the forecast of the items being sold and also the error of this forecast compared to the actual value for weeks 10-20. More specifically, we had to calculate the Mean Absolute Deviation (MAD), Mean Square Error (MSE) and the Tracking Signal (TS). To obtain these numbers we had to first create a forecast of the data by using two methods. One was a moving average and the other exponential smoothing. For the moving average we used an 3pt moving average which created a prediction for the following week. To create this forecast we calculated the average of the previous three weeks and used that number as the prediction for the next. This method was repeated until we received a forecast for weeks 12 through 20. Theoretically, the more points this moving average contains, the more accurate it becomes. As a subset, we chose cereal and cat food to represent the process. The exponential smoothing uses an alpha value, the forecast and the actual value for the previous week to predict the next. One important piece of information to note is that the forecast must be integers. You will always round up to the next whole integer if a decimal was present. The reason behind this is because you cannot purchase half or a piece of an item so a full item must be purchased to fill the void. What the alpha value does in the exponential smoothing is give a weight to how much the previous actual value affects the forecast for the new value. In this case we chose an alpha value of .2. This alpha value was chosen arbitrarily with the constraint that it had to be between 0 and 1. We used the results with the actual values to create a graph and 2 Tables that used the weeks and the amount of units bought (Figures A-1 thought A-6 and Tables 1-4) Using the forecasted values and the actual values, we were able to obtain the MAD, MSE, and TS. The MAD measures the parameters of the data. In other words by how much does the forecast deviate from the actual values without regard to sign. The Mean Square Error measures the difference between the forecast and the actual values. Last we have the Tracking Signal which measures the bias in the forecast. The MAD was calculated by taking the differences between the actual and forecast and summing all this differences without regard to sign, in other words, taking the absolute value of each difference before adding it to the rest.. Lastly, the differences were divided by the number of inputs ,or averaged, and this gave you your MAD. The MSE was calculated in a similar manner, except instead of taking the absolute value of these numbers, they were squared and then again averaged. For the last item, the TS was calculated but
  • 4. 4 summing all of the differences between the actual values and forecasted values and dividing this sum by the MAD of that respective tem. When translated to excel, the main functions used were ABS, ROUNDUP, sqr, and other basic mathematical computations. 2.2 Regression Analysis Using all 104 weeks of the transaction data for the top 10 items, pivot tables were created in excel. Once the top 10 items where obtained, a regression analysis was done in minitab which summarized the ANOVA table, correlation coefficient, R-sq, F-values, fitted models and the regression equation. A table was created that compared the different item types and the R square value. For question #3 a pivot table was used. However we use only 60 weeks of the data and the top three items sold. Regression was done in Minitab to obtain the regression equation. In question #4 we used this regression equation and plugged the range of weeks (61-80) into the equation. Different values were then obtained. 2.3 k-Means Clustering The k-means clustering algorithm was used to cluster the customers based on their buying patterns. The information obtained was used to identify any correlation between the customer’s buying patterns and demographic information. Three different sets of two items were chosen and clustered with k=3. First, the items of eggs and ice cream (items 12 and 13) were chosen. A pivot table was then made in Excel to display each customer ID with the units of eggs and ice cream purchased. The data was then pasted into the sample k-means Excel spreadsheet provided, which then calculated the centroids based on the designated amount of clusters. In the first calculation containing eggs and ice cream, the items were also clustered with k=4,5, and 6 as well. 2.4 Regression Figure 2.1: Top 10 selling items over two years (104 weeks)
  • 5. 5 This screenshot was created in microsoft excel with a pivot table by using the “ item type” as columns the “weeks” as rows and the sum of units bought as the value, it shows the top 10 items that where sold the most in the 104 weeks Table 2.1: Shows the ranking of the top 10 item type soldwith the total of their unit bought in the 2 years. Item Type Ranking Total Volume 17 1 7776
  • 6. 6 12 2 6879 3 3 6254 8 4 5165 5 5 4961 2 6 4006 4 7 3930 13 8 3538 9 9 3382 11 10 2365 The graphs below represent the time series plot of the top 10 items sold and their units bought in the 104 weeks. Graph 2.1: Time series plot of the Weeks vs Number of units bought on item 17
  • 7. 7 Graph 2.2: Time series plot of the Weeks vs Number of units bought on item 12 Graph 2.3: Time series plot of the Weeks vs Number of units bought on item 3
  • 8. 8 Graph 2.4: Time series plot of the Weeks vs Number of units bought on item 8 Graph 2.5: Time series plot of the Weeks vs Number of units bought on item 5
  • 9. 9 Graph 2.6: Time series plot of the Weeks vs Number of units bought on item 2 Graph 2.7: Time series plot of the Weeks vs Number of units bought on item 4
  • 10. 10 Graph 2.8: Time series plot of the Weeks vs Number of units bought on item 13 Graph 2.9: Time series plot of the Weeks vs Number of units bought on item 9
  • 11. 11 Graph 2.10: Time series plot of the Weeks vs Number of units bought on item 11 The regression analysis was performed using minitab . Minitab summarized for us the analysis of variances, ANOVA table, correlation coefficient, R-sq, F-values, fitted models and the regression equation
  • 12. 12 The Simple Linear Regression Analysis of each one of the items against the week are shown below. This was done by carrying out a detailed Analysis of Variance which shows values like degrees of freedom, correlation coefficient, P-values, F-values and the Regression Equation. They were all done using Minitab. RegressionAnalysis: 2 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 745.1 745.1 2.70 0.104 Weeks 1 745.1 745.1 2.70 0.104 Error 102 28196.8 276.4 Total 103 28942.0 Model Summary S R-sq R-sq(adj) R-sq(pred) 16.6265 2.57% 1.62% 0.00% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 43.20 3.28 13.15 0.000 Weeks -0.0892 0.0543 -1.64 0.104 1.00 Regression Equation Y = 43.20 - 0.0892 Weeks Fits and Diagnostics forUnusual Observations Obs 2 Fit Resid StdResid 40 73.00 39.63 33.37 2.02 R 44 0.00 39.28 -39.28 -2.37 R 45 0.00 39.19 -39.19 -2.37 R 46 0.00 39.10 -39.10 -2.36 R 47 0.00 39.01 -39.01 -2.36 R 49 0.00 38.83 -38.83 -2.35 R 50 0.00 38.74 -38.74 -2.34 R 51 0.00 38.65 -38.65 -2.34 R
  • 13. 13 52 0.00 38.56 -38.56 -2.33 R 101 72.00 34.19 37.81 2.31 R R Large residual RegressionAnalysis: 3 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 1042 1042 0.71 0.402 Weeks 1 1042 1042 0.71 0.402 Error 102 150216 1473 Total 103 151258 Model Summary S R-sq R-sq(adj) R-sq(pred) 38.3758 0.69% 0.00% 0.00% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 65.67 7.58 8.66 0.000 Weeks -0.105 0.125 -0.84 0.402 1.00 Regression Equation Y = 65.67 - 0.105 Weeks Fits and Diagnostics forUnusual Observations Std Obs 3Fit Resid Resid 2 167.00 65.46 101.54 2.70 R 22 250.00 63.35 186.65 4.91 R 80 167.00 57.23 109.77 2.89 R 96 156.00 55.55 100.45 2.66 R R Large residual
  • 14. 14 RegressionAnalysis: 4 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 37.6 37.63 0.09 0.766 Weeks 1 37.6 37.63 0.09 0.766 Error 102 43035.7 421.92 Total 103 43073.3 Model Summary S R-sq R-sq(adj) R-sq(pred) 20.5407 0.09% 0.00% 0.00% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 38.84 4.06 9.57 0.000 Weeks -0.0200 0.0671 -0.30 0.766 1.00 Regression Equation Y = 38.84 - 0.0200 Weeks Fits and Diagnostics forUnusual Observations Std Obs 4Fit Resid Resid 98 104.00 36.88 67.12 3.32 R R Large residual RegressionAnalysis: 5 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 2430 2429.9 5.65 0.019 Weeks 1 2430 2429.9 5.65 0.019
  • 15. 15 Error 102 43872 430.1 Total 103 46302 Model Summary S R-sq R-sq(adj) R-sq(pred) 20.7392 5.25% 4.32% 2.04% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 56.15 4.10 13.71 0.000 Weeks -0.1610 0.0677 -2.38 0.019 1.00 Regression Equation Y = 56.15 - 0.1610 Weeks Fits and Diagnostics forUnusual Observations Obs 5 Fit Resid StdResid 44 0.00 49.07 -49.07 -2.38 R 45 0.00 48.91 -48.91 -2.37 R 46 0.00 48.75 -48.75 -2.36 R 47 0.00 48.59 -48.59 -2.35 R 49 0.00 48.27 -48.27 -2.34 R 50 0.00 48.10 -48.10 -2.33 R 51 0.00 47.94 -47.94 -2.32 R 52 0.00 47.78 -47.78 -2.32 R 84 88.00 42.63 45.37 2.21 R 88 87.00 41.99 45.01 2.20 R R Large residual RegressionAnalysis: 8 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 7345 7345.1 14.97 0.000 Weeks 1 7345 7345.1 14.97 0.000
  • 16. 16 Error 102 50042 490.6 Total 103 57387 Model Summary S R-sq R-sq(adj) R-sq(pred) 22.1497 12.80% 11.94% 9.99% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 64.36 4.38 14.71 0.000 Weeks -0.2799 0.0723 -3.87 0.000 1.00 Regression Equation Y = 64.36 - 0.2799 Weeks Fits and Diagnostics forUnusual Observations Obs 8 Fit Resid StdResid 44 0.00 52.04 -52.04 -2.36 R 45 0.00 51.76 -51.76 -2.35 R 46 0.00 51.48 -51.48 -2.34 R 47 0.00 51.20 -51.20 -2.32 R 49 0.00 50.64 -50.64 -2.30 R 50 0.00 50.36 -50.36 -2.28 R 51 0.00 50.08 -50.08 -2.27 R 52 0.00 49.80 -49.80 -2.26 R R Large residual RegressionAnalysis: 9 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 2660 2660.0 12.48 0.001 Weeks 1 2660 2660.0 12.48 0.001 Error 102 21742 213.2 Total 103 24402
  • 17. 17 Model Summary S R-sq R-sq(adj) R-sq(pred) 14.5999 10.90% 10.03% 8.04% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 41.36 2.88 14.34 0.000 Weeks -0.1685 0.0477 -3.53 0.001 1.00 Regression Equation Y = 41.36 - 0.1685 Weeks Fits and Diagnostics forUnusual Observations Obs 9 Fit Resid StdResid 9 69.00 39.85 29.15 2.03 R 27 76.00 36.82 39.18 2.71 R 44 0.00 33.95 -33.95 -2.34 R 45 0.00 33.78 -33.78 -2.33 R 46 0.00 33.61 -33.61 -2.31 R 47 0.00 33.45 -33.45 -2.30 R 49 0.00 33.11 -33.11 -2.28 R 50 0.00 32.94 -32.94 -2.27 R 51 0.00 32.77 -32.77 -2.26 R 52 0.00 32.60 -32.60 -2.24 R 88 57.00 26.54 30.46 2.11 R R Large residual RegressionAnalysis: 11 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 104.4 104.4 0.46 0.499 Weeks 1 104.4 104.4 0.46 0.499 Error 102 23169.6 227.2 Total 103 23274.0
  • 18. 18 Model Summary S R-sq R-sq(adj) R-sq(pred) 15.0716 0.45% 0.00% 0.00% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 24.49 2.98 8.23 0.000 Weeks -0.0334 0.0492 -0.68 0.499 1.00 Regression Equation Y = 24.49 - 0.0334 Weeks Fits and Diagnostics forUnusual Observations Std Obs 11 Fit Resid Resid 88 102.00 21.56 80.44 5.40 R 101 52.00 21.12 30.88 2.09 R 104 67.00 21.02 45.98 3.11 R R Large residual RegressionAnalysis: 12 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 11 10.85 0.01 0.935 Weeks 1 11 10.85 0.01 0.935 Error 102 164700 1614.71 Total 103 164711 Model Summary S R-sq R-sq(adj) R-sq(pred) 40.1834 0.01% 0.00% 0.00%
  • 19. 19 Coefficients Term Coef SECoef T-Value P-Value VIF Constant 65.58 7.94 8.26 0.000 Weeks 0.011 0.131 0.08 0.935 1.00 Regression Equation Y = 65.58 + 0.011 Weeks Fits and Diagnostics forUnusual Observations Std Obs 12 Fit Resid Resid 27 154.00 65.87 88.13 2.21 R 38 152.00 65.99 86.01 2.15 R 41 148.00 66.02 81.98 2.05 R 77 147.00 66.41 80.59 2.02 R 81 171.00 66.45 104.55 2.63 R 96 219.00 66.61 152.39 3.85 R R Large residual RegressionAnalysis: 13 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 939.3 939.3 2.97 0.088 Weeks 1 939.3 939.3 2.97 0.088 Error 102 32214.7 315.8 Total 103 33154.0 Model Summary S R-sq R-sq(adj) R-sq(pred) 17.7716 2.83% 1.88% 0.00% Coefficients
  • 20. 20 Term Coef SECoef T-Value P-Value VIF Constant 39.27 3.51 11.19 0.000 Weeks -0.1001 0.0580 -1.72 0.088 1.00 Regression Equation Y = 39.27 - 0.1001 Weeks Fits and Diagnostics forUnusual Observations Std Obs 13Fit Resid Resid 1 82.00 39.17 42.83 2.46 R 9 83.00 38.37 44.63 2.55 R 101 82.00 29.16 52.84 3.03 R R Large residual RegressionAnalysis: 17 versus Weeks Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 1 74 73.74 0.07 0.788 Weeks 1 74 73.74 0.07 0.788 Error 102 103299 1012.73 Total 103 103372 Model Summary S R-sq R-sq(adj) R-sq(pred) 31.8235 0.07% 0.00% 0.00% Coefficients Term Coef SECoef T-Value P-Value VIF Constant 76.24 6.29 12.13 0.000 Weeks -0.028 0.104 -0.27 0.788 1.00
  • 21. 21 Regression Equation Y = 76.24 - 0.028 Weeks Fits and Diagnostics forUnusual Observations Obs 17 Fit Resid Std Resid 44 0.00 75.01 -75.01 -2.37 R 45 0.00 74.98 -74.98 -2.37 R 46 0.00 74.95 -74.95 -2.37 R 47 0.00 74.92 -74.92 -2.37 R 49 0.00 74.87 -74.87 -2.36 R 50 0.00 74.84 -74.84 -2.36 R 51 0.00 74.81 -74.81 -2.36 R 52 0.00 74.78 -74.78 -2.36 R 104 136.00 73.32 62.68 2.01 R R Large residual Table 2.2: Correlation coefficient and Adjusted correlation coefficient of the top 10 item types sold. Item Type R^2 Adj_R^2 17 0.07% 0.00% 12 2.83% 1.88% 3 0.01% 0.00% 8 0.45% 0.00% 5 10.90% 10.03% 2 12.80% 11.94% 4 5.25% 4.32% 13 0.09% 0.00% 9 0.69% 0.00%
  • 22. 22 11 2.57% 1.62% The correlation coefficients above are shown in percentage.
  • 23. 23 3) Figure 3.1: Top 3 selling items over two years (60 weeks) In order to perform question #3 team constructed a pivot table in excel from the transactional data. In the pivot table we used Item Type as a Column, Weeks as Rows, and Sum of units bough
  • 24. 24 in the . From this we obtained that the top values where 17,12 & 3 with a grand total of 4487, 3815, and 3526 respectably. We used Minitab to obtain “The Fitted Line Plot” and the regression analysis for the top three weeks. From this analysis we were able to obtain the desire equations for question#4. We also perform an OLS regression analysis in excel as well. Grand Total 3526 3815 4487 11828 Row Labels 3 12 17 Grand Total Graph 3.1: Fitted Line Plot the first 60 weeks vs. number of units sold for item 17 RegressionAnalysis: 17 versus Weeks The regression equation is
  • 25. 25 Y = 85.11 + 0.0423 Weeks S = 22.5467 R-Sq = 0.1% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 27.0 27.017 0.05 0.819 Error 50 25417.7 508.353 Total 51 25444.7 Graph 3.2: FittedLine Plot the first 60 weeks vs. number of units sold for item 12 RegressionAnalysis: 12 versus Weeks The regression equation is Y = 76.18 - 0.1012 Weeks
  • 26. 26 S = 33.5337 R-Sq = 0.3% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 154.6 154.61 0.14 0.712 Error 50 56225.4 1124.51 Total 51 56380.1 Graph 3.3: FittedLine Plot the first 60 weeks vs. number of units sold for item 3 RegressionAnalysis: 3 versus Weeks The regression equation is Y = 81.59 - 0.4956 Weeks S = 38.3690 R-Sq = 4.8% R-Sq(adj) = 2.9%
  • 27. 27 Analysis of Variance Source DF SS MS F P Regression 1 3711.2 3711.19 2.52 0.119 Error 50 73608.9 1472.18 Total 51 77320.1 The fitted line charts and the regression analysis was performed for the top 3 items sold in the first 60 weeks (butter, snacks and eggs). The regression analysis performed in minitab provides us the ANOVA table, the regression equation and the R-sq and Adj-R-sq 4) For question #4 our group predicted the total weekly sales quantity for weeks 61 through 80. For this question we used the solution from question #3 and copy the equation for item 17, 12 and 3. Since we have the equation as you can observe from above we replace the numbers 61 through 80 in “Weeks” to predict the sales for each week. For this question Microsoft Excel was used. TABLE 4.1: The forecasted sales from week 61 to 80 for Item 17 Weeks Formula for item 17 Answer 61 Y = 85.11 + 0.0423 Weeks 87.6903 62 Y = 85.11 + 0.0423 Weeks 87.7326 63 Y = 85.11 + 0.0423 Weeks 87.7749 64 Y = 85.11 + 0.0423 Weeks 87.8172 65 Y = 85.11 + 0.0423 Weeks 87.8595 66 Y = 85.11 + 0.0423 Weeks 87.9018 67 Y = 85.11 + 0.0423 Weeks 87.9441 68 Y = 85.11 + 0.0423 Weeks 87.9864
  • 28. 28 69 Y = 85.11 + 0.0423 Weeks 88.0287 70 Y = 85.11 + 0.0423 Weeks 88.071 71 Y = 85.11 + 0.0423 Weeks 88.1133 72 Y = 85.11 + 0.0423 Weeks 88.1556 73 Y = 85.11 + 0.0423 Weeks 88.1979 74 Y = 85.11 + 0.0423 Weeks 88.2402 75 Y = 85.11 + 0.0423 Weeks 88.2825 76 Y = 85.11 + 0.0423 Weeks 88.3248 77 Y = 85.11 + 0.0423 Weeks 88.3671 78 Y = 85.11 + 0.0423 Weeks 88.4094 79 Y = 85.11 + 0.0423 Weeks 88.4517 80 Y = 85.11 + 0.0423 Weeks 88.494 TABLE 4.12: The forecasted sales from week 61 to 80 for Item 12 Weeks Formula for item 12 Answer 61 Y = 76.18 - 0.1012 Weeks 70.0068 62 Y = 76.18 - 0.1012 Weeks 69.9056 63 Y = 76.18 - 0.1012 Weeks 69.8044 64 Y = 76.18 - 0.1012 Weeks 69.7032 65 Y = 76.18 - 0.1012 Weeks 69.602 66 Y = 76.18 - 0.1012 Weeks 69.5008 67 Y = 76.18 - 0.1012 Weeks 69.3996
  • 29. 29 68 Y = 76.18 - 0.1012 Weeks 69.2984 69 Y = 76.18 - 0.1012 Weeks 69.1972 70 Y = 76.18 - 0.1012 Weeks 69.096 71 Y = 76.18 - 0.1012 Weeks 68.9948 72 Y = 76.18 - 0.1012 Weeks 68.8936 73 Y = 76.18 - 0.1012 Weeks 68.7924 74 Y = 76.18 - 0.1012 Weeks 68.6912 75 Y = 76.18 - 0.1012 Weeks 68.59 76 Y = 76.18 - 0.1012 Weeks 68.4888 77 Y = 76.18 - 0.1012 Weeks 68.3876 78 Y = 76.18 - 0.1012 Weeks 68.2864 79 Y = 76.18 - 0.1012 Weeks 68.1852 80 Y = 76.18 - 0.1012 Weeks 68.084 TABLE 4.13: The forecasted sales from week 61 to 80 for Item 3 Weeks Formula for item 3 Answer 61 Y = 81.59 - 0.4956 Weeks 51.3584 62 Y = 81.59 - 0.4956 Weeks 50.8628 63 Y = 81.59 - 0.4956 Weeks 50.3672 64 Y = 81.59 - 0.4956 Weeks 49.8716 65 Y = 81.59 - 0.4956 Weeks 49.376 66 Y = 81.59 - 0.4956 Weeks 48.8804
  • 30. 30 67 Y = 81.59 - 0.4956 Weeks 48.3848 68 Y = 81.59 - 0.4956 Weeks 47.8892 69 Y = 81.59 - 0.4956 Weeks 47.3936 70 Y = 81.59 - 0.4956 Weeks 46.898 71 Y = 81.59 - 0.4956 Weeks 46.4024 72 Y = 81.59 - 0.4956 Weeks 45.9068 73 Y = 81.59 - 0.4956 Weeks 45.4112 74 Y = 81.59 - 0.4956 Weeks 44.9156 75 Y = 81.59 - 0.4956 Weeks 44.42 76 Y = 81.59 - 0.4956 Weeks 43.9244 77 Y = 81.59 - 0.4956 Weeks 43.4288 78 Y = 81.59 - 0.4956 Weeks 42.9332 79 Y = 81.59 - 0.4956 Weeks 42.4376 80 Y = 81.59 - 0.4956 Weeks 41.942 The calculations were done using Microsoft Excel and the regression equations were performed by question #3 (Minitab). Discussion from regression: A Ranking was made from the top 10 items bought in the retail store the total volume of the item 17 which is the most sold item was 7776 units and the lowest item 11 sold 2365 in two years. A time series was performed for each item type and it was observed that from week 44-49 and 52-55 there were no units bought in the retail store, this probability means that store was closed during those weeks or it was in maintenance.
  • 31. 31 R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression is measured in percentage between 0-100. From the data observe the item with a greater R^2 is item 2 which is around %12.80 and item 5 which is %10.9 usually is better to take a look at the adjusted R-squared which has less errors and is more exact. However from this data we can conclude that most of this items have low regression items which is not good for the data and because of this the fitted values will be off.. The top 3 items bought for week 1-60 are 17,12 and 3. The quantity of units bought are 4487, 3815 and 3526. The fitted value did not show any trend in the plot. After obtaining the regression equation we predicted the total weekly sales quantity for weeks 61-80 for each item and we then obtained from the results that the top item sold was 17, then 12 and 3 respectively. Question 5: K-means clustering was performed for three sets of items. As mentioned above in the methods section, eggs and ice cream were the first set of items chosen. Using excel, the data was clustered separately into three, four, five and six clusters and the centroid of each cluster was then found. The resulting scatter plot of eggs and ice cream with k=3 is shown below in figure 5.1 . As shown below, cluster 1 represents the customers that purchase a moderate amount of these items. Cluster 2 represents the customers that purchase low volumes of eggs and ice cream and cluster 3 contains the customers that purchase these items in abundance.
  • 32. 32 Figure 5.1: K means clustering for Eggs vs. Ice cream for K=3 Figure 5.2: K means clustering for Eggs vs. Ice cream for K=4
  • 33. 33 Figure 5.2 shows the K means clustering for eggs vs ice cream for k=4. Cluster number 1 represents the customers that bought a moderate amount of eggs and ice cream. Cluster 2 represents a smaller cluster of customers that bought a more moderate amount of ice cream but a much larger amount of eggs. Cluster 3 represents customers that bought a minimum of each item and cluster 4 represents the customers that bought a more moderate amount of eggs but a very high amount of ice cream. Figure 5.3: K means clustering for Eggs vs. Ice cream for k=5 Figure 5.4: K means clustering for Eggs vs. Ice cream for K=6 The same analysis was repeated for the Pizza vs Snacks example as well and the results are shown below in figure 5.5. Pizza and snacks were chosen together because they are very similar and they are two items that often go together as for parties etc. Cluster 1 as shown in blue represents the customers that purchased a low volume of pizza and snacks. Cluster 2 as shown in red represents the customers that purchased a high volume of snacks and pizza, whereas cluster 3 represents the customers that purchased a moderate amount of these items. Pizza is represented on the x axis and snacks on the y axis.
  • 34. 34 Figure 5.5: K Means clustering for Pizza vs. Snacks for k=3 Figure 5.6: K means clustering for BBQ vs Cereal for K=3 BBQ and cereal were chosen as two random items that would not normally go together. Cluster 1 in blue represents customers who bought a low volume of both items. Cluster 2 in red represents customers who bought a low volume of BBQ (x-axis) and a higher volume of cereal
  • 35. 35 (located on the y-axis). Cluster 3 in green represents customers who bought a moderate amount of both items. Question 6: Out of the 494 families in the data and the ones that responded, it was found that only 9 families did not have any TV’s in their homes. This is a mere 2% of the data and is important for advertisement purposes. If 98% of your customers have a television in their homes, the chance of catching their attention through a television advertisements would most likely be the greatest. The average income of the data was equivalent to 6 which represents the $25,000-$35,000 range. This range is lower on the income scale and it is an important thing to acknowledge from the store owner’s perspective. In order to be successful, it is important to know information about your customers and the financial background they are coming from. Figure 6.1 below is a bar graph that shows the distribution of the customers in the database according to their income number which is found in the data dictionary, as you can see the majority of the customers fall in between income range 6 and 7. The countifs function was used on the demographics data in Excel to calculate the number of households with dogs and cats. It was found that 169 households had one or more dogs and 79 had one or more cats. Those totals together are only equal to about half of the families in the database which can be important to store owners because maybe money can be saved by cutting back on animal sections in the store etc. It was also found that 488 of the families had at least one child, that is merely 99% of the customers. Every household that had a cat or dog also had at least one child. This is important for many purposes such as advertising methods, sales, and coupon distribution. As far as magazine subscriptions, it was found that the highest amount of subscriptions was 125 for Reader’s Digest and the smallest was 12 for Cosmopolitan, followed by 13 for Glamour. Not a lot of information can be obtained from this information besides the fact that advertisements through magazines would not be the best option for the store because not enough of the customers have subscriptions.
  • 36. 36 Figure 6.1 Distribution of customers by income range
  • 37. 37 Contents 1Table of Contents 1.) Introduction……………………………………………………………………………….37 2) Summary………….………………………………………………………...……………...37 3.) Methods…………………………………………………………………………………...38 2.1) Description of Data………………...…………………………………………….38 2.2) Database and Table Schema………………………………….…………………..38 2.3) Queries……………………………………………………….…………………..42 3.) Results……………………………………………………………………………………42 3.1) Description……………………………...……………………………………….42­60 4.) Discussion………………………………………………………………………………..61 4.1) Gender………………………………………………………………………….62 4.2) Income………………………………………………………………………….63 4.4) Weeks of the year………………………………………………………………...63 5.) Conclusion……………………………………………………………………………….66 6.) Appendix: Query Results and Graphs……………………………………...……………66­71 1 Introduction By using Microsoft Access the group will be able to provide an essential data analysis of the store for the transaction and demographics data, in order to make effective recommendations about areas for additional research and effective marketing and management strategies. With this new database the retail store will be able to make better decisions for the growth of their business.
  • 38. 38 The group will analyze the data from the transaction and the demographics tables and draw conclusions by pointing out the correlated items to obtain predictions. The trends will be shown in graphics later on. Finally the group will advise the retail store on some recommendations for improvement Microsoft Access will be the method that the group will use to analyze the data. The first objective is to create the tables in order to do the relationships. The demographics and transaction data will be broken down into four tables, which are ∙ Demographics ∙ Transaction ∙ ItemType ∙ CouponID Each table is conformed by a primary key and foreign keys to make relationships within the tables, there will not be redundancy in any of this tables, our group made sure to put time and effort in order to apply organized tables to link relationships. From this relationships queries will be performed to obtain a sense strategy the company should apply to increase the sales. The idea of creating queries is to obtain a presentable and organized data in order to run reports for the retail store in an efficient manner. Twenty queries will be generated in Microsoft Access, after the data is organized the most influential categories will be exported to Microsoft Excel to highlight the key components and trends in the data. After obtaining the results it will be easy from the team to obtain conclusions and recommendations for the retail store. Given the variety and amount of data, A number of analysis and visualization techniques where done in order to generate conclusions. The goal is to take the wide variety of useless data and transformed into a meaningful one. By carefully creating queries by visualization techniques it will be easier to obtain conclusions since they will be able to connect each other Summary The objective of this assignment was to analyze the transaction and customer demographic data given in the project description, in order to draw conclusions by performing queries on the given data and be able to give recommendations and predict future trends. Microsoft Access, and Microsoft Excel were used in order to visualize and evaluate the data. Using these methods, the most useful data was considered and these observations were used to formulate strategies to improve the store business strategy. In the demographics tables many attributes were examined in great detail such as income, gender, work hours, and ethnicity. A relationship was made with the transaction data in order to understand the patterns. It was found that income, work hours , gender and weeks are key attributes that have a large impact on purchasing items from the store.
  • 39. 39 2. Methods 2.1 Description of Data From this report two sets of data are given which are the transaction and the demographics data. The transaction data has relevant information, the primary Key in the table is the customer ID, This will allow the data analyst to view the transaction made by the customer including the week and the day that he made this purchase. The customer ID listed in the transaction table is located in the demographics table as well. By having this link the retail store can obtain personal information about the specific customer such as ethnicity, family size, income, education level, among others. The analyst can use the demographics data to analyze the customer base and profile of the customer which will allow the manager to obtain a better knowledge of the group that is coming to the retail store and what items are their buying and in what date is the transaction being made. With this information the analyst can observe behaviors and trends that customers have with their personal information. From this, a variety of conclusions can be done. 2.2 Access Database and Table Schema Tables were provided from the data received from our retail stored from this project concerning the transaction and the demographics data. The team focused on efficiency while creating these tables by adding the minimum number of tables and attributes. The concise tables are: Transaction, Demographics, ItemType and CouponID. Each of them has their own primary key, after doing the relationship among the tables it was easy to create queries and reports. With this results it is very helpful to analyzing trends and relationships. 1.) Demographics table: This table captures all of the customer demographic data provided. The primary key is the unique customer ID number. The table also contains info about the customer’s family, income, education level and subscriptions to magazines.
  • 40. 40 Figure 1: Demographics Table 2.) Transaction table: The transaction table uses the unique transaction ID as the primary key. It organizes the information contained in each individual customer transaction. Information contained in this table includes the type and number of the item being purchased, the day and week of the purchase and coupon usage.
  • 41. 41 Figure 2: Transaction Table 3.) Coupon table: The coupon table uses a unique Coupon ID as the primary key. This table contains information about the coupons available to the customers and their dollar value. Figure 3: Coupon Table 4.) Item table: The item table uses the unique item type as the primary key. The data dictionary includes 24 item types unique to each value. Figure 4: Item Table After the creation of the database tables, the relationships amongst them were made and defined. The referential integrity was enforced amongst them and is shown below in Figure 4. The primary keys of each table are included in the other tables as foreign keys.
  • 42. 42 Figure 4: Database Schema 2.3 Queries
  • 43. 43 A total of 20 queries were executed in Microsoft Access in order to analyze the relationships among the various data categories. A wide range of the data categories were considered and the most important ones were included in the queries. For example, it was decided that the education level of customers was not a key area of concern, since the items listed in the database are generic and widely used by the general public and most likely are not affected by education level. The results of these queries, along with the SQL code produced are displayed below. 3.0 Results 1.) This query displays the sum of the units bought for each item in descending order SQL code: SELECT ItemType.Description, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought] FROM ItemType INNER JOIN [Transaction] ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY ItemType.Description ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: The first query shown below is a calculation of the total number of units sold per item. This table shows that snacks are the most sold item and pills are the least sold item. This information is important to store management and vital for their success. It is important to have a large enough inventory of these products to satisfy the customer demand. The placement of these products in the store can be important for advertisement purposes such as leading customers to buy other products.
  • 44. 44 Figure 5 2.) This query displays the total sum of units bought from each day of the week SQL code: SELECT Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Transaction.Day FROM [Transaction] GROUP BY Transaction.Day; Description: The second query shown below shows the total number of items that were bought on each day. The data shows that the most items were purchased on day 5 and 6 which represents Friday and Saturday. Using the countifs function in excel, it was found that only 2 out of the 495 given families had both the male and female parent unemployed. This means that about 99% of the families in the data have at least one family member who is employed, making activities such as shopping more convenient on the weekend. This data indicates the store will be the busiest on Fridays and therefore should have extra employees for stocking shelves and check out lines in order to keep the customers satisfied.
  • 45. 45 Figure 6 3.) This query counts the number of families of size 4, 5 and 6 with cable TV. SQL code: SELECT Count(Demographics.[Cable TV]) AS [CountOfCable TV], Demographics.[Family Size] FROM Demographics GROUP BY Demographics.[Family Size], Demographics.[Cable TV], Demographics.[Family Size] HAVING (((Demographics.[Cable TV])=1) AND ((Demographics.[Family Size])>=4)); Figure 7 Description: This query provides a count of families of size 4, 5 and 6 that have cable TV. The results show that 16 families of 6 or more people, 24 families of 5 people and 59 families of 4 people have cable TV. These totals refer strictly to the larger family size indicating that there is most likely a variety of aged people living in the households. This information could be useful to television marketers who are looking to attract a specific age range of people within these larger families. 4.) This query displays the days of the week in descending order based upon when part time (working less than 35 hours a week) women shop the most SQL code:
  • 46. 46 SELECT Transaction.Day, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.[Female Work Hours] FROM CouponID INNER JOIN (Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON CouponID.CouponID = Transaction.CouponID GROUP BY Transaction.Day, Demographics.[Female Work Hours] HAVING (((Demographics.[Female Work Hours])=3)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: The fourth query shown below shows the days of the week that the women who work part time tend to shop on the most. The results show the majority of them tend to do their shopping towards the end of the week, with the most shopping done on Saturdays. This is similar to the results shown above in figure 6, which shows the most items being sold on Friday and Saturday. These results relate back to the importance of having extra staff on the weekends in order to keep the customers satisfied and in and out of the store in a timely fashion. Figure 8 5.) This query displays the days of the week in descending order based upon when part time (working less than 35 hours a week) men shop the most SQL code: SELECT Transaction.Day, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.[Male Work Hours] FROM Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID] GROUP BY Transaction.Day, Demographics.[Male Work Hours]
  • 47. 47 HAVING (((Demographics.[Male Work Hours])=3)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: The fifth query shown below shows the days of the week that the men who work part time tend to shop the most. The male work hours column that is equal to 2 refers back to the data dictionary which shows a value of 2 equaling work hours of less than 35 hours a week. The results show that the majority of part time working men tend to do their shopping towards the end of the week on days 5 and 6. In this regard, the data displays the same results as shown above for part time working women. The analysis also showed the major difference in part-time working males who shopped compared to women. As shown below in figure X, the most shopping was done on day 6 by 366 male citizens in comparison with the 1750 female shoppers on day 6. This information is important to the store owner because it tells them about the shopping crowd they will be expecting. Figure 9 6.) This query displays the top 5 customer ID’s of the people who purchased the most ice cream SQL code: SELECT TOP 5 Transaction.[Custome ID], Transaction.[Item Type], Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought] FROM [Transaction] GROUP BY Transaction.[Custome ID], Transaction.[Item Type] HAVING (((Transaction.[Item Type])=13)) ORDER BY Sum(Transaction.[Units Bought]) DESC;
  • 48. 48 Description: The sixth query shown below shows the top 5 customers that bought the most of item 13 which is ice cream. It shows the most ice cream bought by a single customer was equal to 90. This information can be used to analyze specific demographic information about these customers and to form any relationship among them that could increase sales in the future. This information can be evaluated for any item and is also important for the distribution of coupons to increase the chance of the best customer’s returning to the store. The customers will in turn be rewarded for their top purchases. Figure 10 7.) This query displays the items in order from the most to least purchased for customers that make an income of $15,00-$20,000. SQL code: SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY ItemType.Description, Demographics.Income HAVING (((Demographics.Income)=4)) ORDER BY Sum(Transaction.[Units Bought]) DESC;
  • 49. 49 Figure 11 below shows the most purchased items by customers within the income range of 15-20,000 Description: The query number #7 shows The most purchased item by customers who make between $15,000 and $20,000 is eggs, followed by snacks and butter. Furthermore, since the total number of transactions for each item is fairly close to the total number of each item purchased, these customers are more likely to only buy one item at a time. Encouraging “buying in bulk” by offering discounts when customers purchase more than one item could increase the number of items bought per transaction. 8.) This query displays the number of items sold in order from highest to lowest for customers with an income range of greater than $75,000 SQL Code: SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY ItemType.Description, Demographics.Income HAVING (((Demographics.Income)=11)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description:
  • 50. 50 Query number 8 shows the most purchased item for customers with an income range greater than $75,000 is cat food. This information may be useful to analyze other important demographic information about the customers in this income range such as number of pets. The following top selling items are snacks, cereal and crackers which are common to all of the income ranges. This shows that these common items are most likely immune to income discrepancies and are purchased by a majority of the customers. Figure 12 below shows the most purchased items by customers within the income range of greater than 75,000 9.) This query displays the number of items sold in order from highest to lowest for customers with an income range of $55,000-$65,000. SQL Code: SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income
  • 51. 51 FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY ItemType.Description, Demographics.Income HAVING (((Demographics.Income)=9)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: As shown in Figure 13 below, for customers that make an income between $55,000-65,000 eggs were the top selling item followed by snacks and butter. These are common, everyday grocery items, the same that were observed to be purchased very frequently by many other customer demographics, including lower income groups. This suggests that these items may be immune to income discrepancies and are widely consumed by households within a wide range of financial situations. Figure 14 10.) This query displays the day of the week that retired women shop the most
  • 52. 52 SQL Code: SELECT Transaction.Day, Count(Demographics.[Female Work Hours]) AS [CountOfFemale Work Hours], Demographics.[Female Work Hours] FROM Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID] GROUP BY Transaction.Day, Demographics.[Female Work Hours] HAVING (((Demographics.[Female Work Hours])=4)) ORDER BY Count(Demographics.[Female Work Hours]) DESC; Description: The results below show that the majority of retired women shop on Fridays and Saturdays. This is similar to the shopping patterns found where it shows the most items are sold at the store on Fridays and Saturdays. 11.) This query displays the days of the week that retired men shop the most SQL Code: SELECT Transaction.Day, Count(Demographics.[Male Work Hours]) AS [CountOfMale Work Hours], Demographics.[Male Work Hours] FROM Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID] GROUP BY Transaction.Day, Demographics.[Male Work Hours] HAVING (((Demographics.[Male Work Hours])=4)) ORDER BY Count(Demographics.[Male Work Hours]) DESC; Description:
  • 53. 53 The table below shows the days in which retired men go shopping the most in order from highest to lowest. The results show that the majority of retired men go shopping on Fridays which is the same as the results of retired women. 12.) This query shows how the sales for one particular item, pizza, vary from week to week. (Appendix 1) SQL code: SELECT Count(Transaction.[Item Type]) AS [CountOfItem Type], Transaction.Week, Transaction.[Item Type] FROM [Transaction] GROUP BY Transaction.Week, Transaction.[Item Type], Transaction.[Item Type] HAVING (((Transaction.[Item Type])=16)) ORDER BY Count(Transaction.[Item Type]) DESC; Explanation: This query showed how the sales of one particular item, in this case pizza, vary from week to week. In this case, the sales don’t fluctuate too much among each other for any of the given weeks. The sales of pizza are pretty steady as shown which means there should be a constant inventory of it in the store. 13.) This query shows an average of each unit bought for families with children. An average value of units bought is given for values 1-8, which the data dictionary indicates are all families with children. As expected, the numbers increase as the number of children in the family increases. The results are shown in Appendix I. SQL Code: SELECT Demographics.Children, Transaction.[Item Type], Avg(Transaction.[Units Bought]) AS [AvgOfUnits Bought] FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON
  • 54. 54 Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY Demographics.Children, Transaction.[Item Type] HAVING ((Not (Demographics.Children)=0)); 14. This query shows the number of coupons and of what origin, that each ethnicity used SQL Code: SELECT Demographics.Ethnicity, Count(CouponID.CouponOrigin) AS CountOfCouponOrigin, CouponID.CouponOriginFROM ItemType INNER JOIN (Demographics INNER JOIN (CouponID INNER JOIN [Transaction] ON CouponID.CouponID = Transaction.CouponID) ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type]GROUP BY Demographics.Ethnicity, CouponID.CouponOrigin; Description:
  • 55. 55 This query shows the number of coupons that each ethnicity used. As the data dictionary states the ethnicity value of 1 represents white people and coupon origin 19 represents ActNow. As the figure above shows for example, there was one person from the white ethnic group that used a coupon from ActNow. This information could be very important to the store owner in order to target specific demographic areas with specific coupons. This information also shows black people (ethnicity value 2) are the largest ethnic group that used zero coupons with 4271 people using zero. This information could also be useful to target these people with more coupons and try to increase the store’s sales. 15. Top items that people of Ethnicity number 3 bought (Hispanic) SQL Code: SELECT ItemType.Description, Demographics.Ethnicity, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought] FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY ItemType.Description, Demographics.Ethnicity HAVING (((Demographics.Ethnicity)=3)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: The results shown below show that the top purchased item by people of ethnicity number , which according to the data dictionary is Hispanic, was cereal which was followed by snacks and detergents.
  • 56. 56 16. Top purchased items by ethnicity number 2 (Blacks) SQL Code: SELECT ItemType.Description, Demographics.Ethnicity, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought] FROM ItemType INNER JOIN (Demographics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY ItemType.Description, Demographics.Ethnicity HAVING (((Demographics.Ethnicity)=2)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: The results shown below show that the top purchased item by people of ethnicity number 2, which according to the data dictionary is Black, was eggs which was followed by snacks and butter.
  • 57. 57 17.) This query displays the total number of items sold in the weeks of the summer SQL Code: SELECT ItemType.Description, Transaction.Week, Sum(Transaction.[Item Type]) AS [SumOfItem Type] FROM ItemType INNER JOIN [Transaction] ON ItemType.[Item Type] = Transaction.[Item Type] WHERE (((Transaction.Week)>=26 And (Transaction.Week)<=38)) OR (((Transaction.Week)>=78 And (Transaction.Week)<=90)) GROUP BY ItemType.Description, Transaction.Week ORDER BY Transaction.Week, Sum(Transaction.[Item Type]) DESC; Description: This query shows that the customer’s buying behaviors are very similar to the buying behaviours they exhibit throughout the rest of the year. The top selling item as shown below is snacks which is common among all of the evaluations. The following top selling items are eggs,
  • 58. o 58 crackers and ice cream. This informati n shows that it may be beneficial to the company to offer sales on these items throughout the weeks of summer. (rest displayed in access file!) 18.) This query displays the number of items sold in order from highest to lowest for customers with an income range of greater than $45,000-$55,000 SQL Code: SELECT Count(ItemType.[Item Type]) AS [CountOfItem Type], ItemType.Description,
  • 59. o 59 Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought], Demographics.Income FROM ItemType INNER JOIN (Dem graphics INNER JOIN [Transaction] ON Demographics.[Customer ID] = Transaction.[Custome ID]) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY ItemType.Description, Demographics.Income HAVING (((Demographics.Income)=8)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: Query 18 shows the top selling item for customers with an income range between $45,000 and $55,000 was cat food followed by snacks, butter, eggs and cereal. 19.) This query shows the number of each item that was sold per day for days 1-7.
  • 60. o 60 SQL Code: SELECT Transaction.Day, ItemType.Description, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought] FROM ItemType INNER JOIN (Coup nID INNER JOIN [Transaction] ON CouponID.CouponID = Transaction.CouponID) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY Transaction.Day, ItemType.Description; Description: The results from query 18 shown below show the sum of each item purchased on days 1-7. For example as shown in figure X below, 181 units of bacon were purchased on Monday in comparison to 193 units of bacon sold on Tuesday. The sum of units of ice cream bought significantly increases on days 5 and 6 which is Friday and Saturday. This makes sense because its the weekend and people will have a greater chance of wanting ice cream on weekends for activities such as party’s etc. This variety in purchases shown can be important to store owners to predict sales and inventory.
  • 61. o 61
  • 62. 62 20. This query shows the top 5 items sold on day 6 (Saturday) SQL code: SELECT Top 5 Transaction.Day, ItemType.Description, Sum(Transaction.[Units Bought]) AS [SumOfUnits Bought] FROM ItemType INNER JOIN (CouponID INNER JOIN [Transaction] ON CouponID.CouponID = Transaction.CouponID) ON ItemType.[Item Type] = Transaction.[Item Type] GROUP BY Transaction.Day, ItemType.Description HAVING (((Transaction.Day)=6)) ORDER BY Sum(Transaction.[Units Bought]) DESC; Description: As shown below in figure X, the top 5 items bought on Saturdays are snacks, eggs, butter, cereal and cook. This information can be very important to store store owners to predict inventory and sales. This information can also help in the planning of item sales which would draw in additional customers to the store. 2.5 Discussion: Once the database, queries and graphs were created, important trends were observed from the data. A summary of the data analysis is provided in 4 main categories ∙ Gender ∙ Income ∙ Week of the year ∙ Work hours Work hours: The different work hours among the customers in the database were examined for any potential buying patterns or trends. It was found that the most popular day to shop for part time men
  • 63. 63 and women was on Saturdays. Although the number of women part time shoppers greatly surpassed the number of males with a count of 1750 compared to 366, it was found that both genders preferred to do their shopping on Saturdays. As shown in query 10 and 11, the same calculation was done for retired males and females and the results showed that the most popular day to shop for both genders was on Friday. In this case, the amount of Friday shoppers for both male and females nearly doubled the Saturday shoppers but Saturdays were still the second busiest day to shop. All of these results show that the stores are the busiest on the weekends. This concept is very important for store managers to recognize in order to have the proper staff on hand to keep customers happy and to have a fully stocked inventory to avoid running out of items. Gender Specific trends related to gender were discovered when analyzing the queries our group created, which lead us to many important conclusions. The female customers in the data bought a much higher amount of items compared to men making them the bigger overall shoppers in the database. From the given query results we can see the comparison between male and female part time workers and their shopping habits (Units bought vs Hours of work). This particular example shows that in total on Saturday’s the women bought 1750 items while men just bought 366 items. This difference in gender is surprisingly high. The same pattern was not present when we analyzed the shopping patterns of retired males and females. From the queries obtained we can observe that the retired men spent a similar amount of time shopping as women and the total of their items bought were also very similar. In other words, retired men shop as equal as retired female. We also found that the most frequent day that retired individuals go shopping is on Friday and then Saturday. This may be the result of some kind of sales that may occur in the shop on these days. The total number of items retired males bought on Fridays was 4232 in comparison to 4297 for retired females. Income After finalizing the queries and the results, the group came to the conclusion that customers income did infact have a large influence on the customers buying behavior. At first we can see from query #7 that customers with a low income, between $15,000-$20,000 most commonly bought item was eggs followed by snacks and butter. The count of the item purchased was also very close to the sum of units bought, indicating that the majority of these customers buy each item one at time. One way to possibly increase these numbers is to increase the coupons distributed for these items to these specific demographic regions. As a commonality item among most of the data, item number 17 which was snacks seemed to be the most popular. However, the most purchased item within each income range differed slightly. From the lowest income group to the highest income group, the most frequently purchased items were as follows: snacks, cat food, eggs, cereal, and cat food. Although the
  • 64. a 64 distinct reason for these observations is not clear, more research may be beneficial to the store. Knowing that these are the most frequently purchased items, management at the retail store could make some rearrangements to the store layout. For example, they could place these items towards the front of the store so that customers could find them easily, or they could place these items in the back of the store so that customers would have to walk past many different items on their way, and potentially buy some items that they weren’t looking for. Overall, customer income does have an affect on customer’s overall buying behavior at the store, both with the coupons they use and the items they purchase. Weeks of the year: Another factor that was analyzed from the data was how the sales of certain items changed from week to week. One of our queries (number 12), showed the different sales from one week to another for item number 16 which was pizza. As the results given in Appendix I show, regardless of season or week, the sales in pizza did not differ much from one another. There were one or two fluctuating weeks in the whole results but this fluctuation could have been a result of a birthday party or another special occasion that happens once or twice a year. There were no major fluctuations in pizza sales from one week to another. The group also analyzed the top selling items in the weeks of the summer and it was found that snacks were the highest selling item followed by eggs and crackers. These items are commonality items that were found in many of our analysis as the top sellers. From one week to another there were slight changes within the units of these items purchased but nothing drastic. This means that the store does not need to make any special preparations for the summer season and can save money that may have been spent here and spend it elsewhere. Figure A-1: Sum of Units bought by day of the week This pie chart shows the percentage of items bought by day. From this chart we can clearly tell that the most sales occur on fridays and thursdays. This is very useful in determining when the
  • 65. 65 perfect day would be to restock the store. This information was obtained from a query created in access showing the relation of the amount of items bought by each day. Figure A-2: Graphs showing Ice cream and Soft Drinks during the summer The two graphs shown here were created by finding the relation of items sold during the summer. I then specifically selected Ice Cream and Soft Drinks as my 2 choices for comparison because they both seem to have a similar pattern over the summer. For the most part they tend to have similar spikes around the s me time periods of the 1st and 2nd summers. This information can help stores prepare and plan their inventory for the these time periods to optimize sales. Figure A-3: Pie Chart of percentage of items sold by item type This Pie chart can also help the stores prepare their stock. They can use this to rank how popular an item type is. The more popular the item type is, tells you that you should consider having a wider range of this specific type. You can try stocking different flavors or different brands.
  • 66. a 66 Also for the items that are not so popular, you can consider either advertising it better so that it sells more, or even slowly decrease the amount of this item you stock on the shelves. Figure A-4: Graph of the Items most bought by people with Income 10 Stores should design their stores and inventory around the customers of the area. If there is a vast number of a specific demographic for example this figure, taking the items that they mostly buy can give insight on how to design your store around your customers. This idea optimizes revenue and helps gain loyal customers. Figure A-5 These two pie charts show the percentage of items bought by males and females, respectively, who work over 35 hours. Using this information you can better understand your customers needs and cater to them. For example, since most males that work over 35 hrs shop on a wednesday, you
  • 67. 67 can then make small changes that would make the shop a more suitable environment for that customer without losing the other customers interest completely. Conclusion: This project served as great practice in taking a large quantity of meaningless data, analyzing, and then transforming it into meaningful information.The first part of the project was used to see how reliable this data is. The calculations from the MAD and MSE show that this data is in fact reliable. The method prefered would be the exponential smoothing with alpha=0.2 because the MAD values are lower for the ES than the 3pt moving average for both chosen items. Also the TS was reasonably low so there also seemed to be very little bias. Many relationships were established among the various categories of data in order to observe basic patterns and trends. The queries created in Access proved to be one of the most essential parts of this project. The queries allowed the analyst to view only specific and desirable attributes of the data and make conclusions about the data from these results. This also allows the analyst to obtain very specific information bout the data and to observe how specific attributes interact with one another. Another crucial takeaway from this project is the importance of data visualization. The use of various charts and graphics to display data makes patterns and trends a lot easier to interpret. In conclusion, data analysis as a whole can be complicated but it is absolutely vital in order to make valuable and correct conclusions from large volumes of data. Appendix I: Table A-1: 3 Point Moving average for cat food
  • 68. a 68 Table A-22: Exponential smoothing for cat food (alpha=0.2) Figure A-6: Graph Exponential Smoothing for cat food (alpha=0.2) Table A-3: 3 Point Moving average for cereal
  • 69. 69 Table A-4: Exponential smoothing for cereal (alpha=0.2) Figure A-7: Graph Exponential Smoothing for cereal (alpha=0.2)
  • 70. 70 Full query results located in Access file
  • 71. 71 Roles Wilson: Moving Average and Exponential Smoothing, Query graphs, Jessica: K-means graphs and question 6 (open ended question), Queries and descriptions Aaron: Regression, Queries and descriptions