Math 381 Project Two
Group 9
Alex Forney
Keren Lai
Gerard Trimberger
Xinyu Zhou
December 7, 2016
1
1 Introduction
When we buy products in grocery store, we find the things we want to buy are usually not located
near each another, and it is common to find that one part of the store is crowded while others
have few customers. This may be because store managers or other higher-ups plan the store layout
while taking into consideration the similarities of products’ sales. He/she may place items often
purchased together in locations farther apart in the store. So, customers may need to stay in
the store longer, resulting in these customers seeing more items and potentially purchasing them.
Another added benefit may be the reduction of congestion is departments with popular items. In
our project, we seek to find the relationships between different departments of a grocery store using
multidimensional scaling (MDS). We will plot the activity of 10 different departments (Packaged
Produce, Deli, Bakery, Dairy, Meat, Dry goods, Fresh Produce, Coffee shop, Seafood, and Sushi)
in order to show the similarities and differences between them. The result of our study may provide
insight into the planning of grocery stores and/or customer habits.
2 Background
2.1 Idea
We began the brainstorming process by each formulating a list of topics that we were interested in,
both mathematically and socially. We also created a list of our individual skill sets and experience
that we felt was relevant to the project. We then spent time reading through each of our responses
to get an idea of what type of project we could all find interesting. We all agreed that we wanted
to do something related to a common situation that most people experience on a daily basis.
It is always more interesting if people can directly relate to the project rather than working on
something that they do not have personal experience with. Our second criteria, was that we each
wanted to do something related to probabilities or Monte Carlo simulation. Keren and Xinyu
are ACMS/Economics double majors so they were both interested in the processes involved in
economic development.
Our first formulation of the proposal involved comparing the total sales and overall market
share of different car manufacturers. We wanted to build a Markov chain of different manufacturer
states and how they relate, in order to predict how the current market share distribution would
change over time. Ultimately, we felt that we would be unable to obtain the necessary data for
an interesting Markov chain, i.e. the number or probability of a car owner moving from one
manufacturer or another. Other outside factors, such as owning multiple cars, created additional
problems that we eventually felt would hinder our progress.
At this point, we decided to switch gears. While keeping the original overarching goals in
mind, specifically a publicly relatable problem and something probability/simulation based, we
formulated a new proposal that involved simulating a grocery store checkout process. We planned
on contacting a local grocery store for real-life customer and item distribution data. Gerard went
into his local QFC on Friday, November 18th. He asked to speak with the manager of the store,
and presented the situation to her, asking specifically if we could obtain some data for customer
checkout times, their number of items, and what types of register (Normal, Express, or Self-
Checkout) that they utilized to make their purchase. The manager suggested that he call back
on Saturday (11/19) when the bookkeeper was present, because the bookkeeper is the one with
access to that type of information. When Gerard called back on 11/19 he was informed that the
bookkeeper had called in sick, and that he would either have to call back on Monday or to try a
different store. The manager provided a phone number to another store in the region that had
their bookkeepers present on 11/19. Gerard followed through with this lead and presented the
situation to the other store manager. This new store manager did not seem to comprehend the
issue and advised Gerard to contact QFC Corporate for more information. Gerard then called the
Corporate phone number provided and left a message on their answering machine informing them
that we would like to talk as soon as possible. Gerard waited until Monday morning (11/21), and
when he had not heard back from corporate, decided to contact the manager at the local QFC
once again. This time he was able to speak directly to the bookkeeper of the store, and confirmed
that there was customer data available in the computer system but that it may not be exactly
what we were looking for. He provided his name and number and was told that if he did not hear
back from the store later that day, to come in on Tuesday (11/22). Gerard did not receive a call
2
during this time, so on Tuesday morning around 10 am he went in to the local QFC in person to
observe the situation firsthand.
Upon speaking to the manager, she led Gerard into the backroom of the store and introduced
him to her bookkeeper. From this point, Gerard worked directly with the bookkeeper to obtain
data that he felt could be useful to our project. Gerard was able to obtain an hour by hour
breakdown of the activity (i.e. item count, sales amount, and customer count) of each of the 10
departments of the store (packaged produce, deli, bakery, dairy, meat, dry goods, fresh produce,
seafood, coffee, and sushi). Unfortunately, this was not the data that we had originally intended
on receiving for our grocery store checkout simulation, but that did not mean that it wasn’t useful.
We met up as a team and discussed how we wanted to move forward with this new information.
We brainstormed a proposal for a new project that we could formulate, based on the data that we
were provided. We settled on creating an MDS model comparing the different departments on an
hour by hour basis, based on their normalized distributions for each indicator. The details of the
model are explained below.
2.2 Similar Modelings
Multidimensional scaling (MDS) is a set of data analysis techniques that display the structure of
distance-like data as a geometrical picture. Evolving from the work of Richardson, [1] Torgerson
proposed the first MDS method and coined the term.[2]. MDS is now a general analysis technique
used in a wide variety of fields, such as marketing, sociology, economies etc. In 1984, Young and
Hamer published a book on the theory and applications of MDS, and they presented applications
of MDS in marketing. [3]
J.A. Tenreiro Machado and Maria Eugenia Mata from Portugal analyzed the world economic
variables using multidimensional scaling[4] that is similar as we do. Tenreiro and Mata analyze
the evolution of GDP per capita,[5] international trade openness, life expectancy and education
tertiary enrollment in 14 countries from 1977 up to 2012[6] using MDS method. In their study,
the objects are country economies characterized by means of a given set of variables evaluated
during a given time period. They calculated the distance between i-th and j-th objects by taking
difference of economic variables for them in several years period. They plot countries on the graph
and distinguish countries by multiple aspects like human welfare, quality of life and growth rate.
Tenreiro and Mata concluded from the graphs that the analysis on 14 countries over the last 36
years under MDS techniques proves that a large gap separates Asian partners from converging
to the North-American and Western-European developed countries, in terms of potential warfare,
economic development, and social welfare.
The modeling Tenreiro and Mata use is similar as we do. In our projects, the objects are
departments in grocery store. They studied the difference/similarity between country economies
through years, while we study the difference/similarity between different departments through
hours in a day. In Tenreiro and Mata’s research, the countries developed at the same time are
close on the graphs; in our study, the store departments that are busy at the same time are close
on the graphs. However, the database of our project is much smaller than theirs. We compared
departments from the data of the number of items sale, customers’ number and the total amount
sale at a given time period. Tenreiro and Mata’ s data is more dimensional, from GDP per capita,
economic openness, life expectancy, and tertiary education etc. And also our project studies
similarity of busyness from another side: percentage of each department sale at the given hour.
2.3 Similar Problems
The objective of our project is to help the grocery store owner to plan the layout of different blocks
of store and increase store’s sale by finding the interrelationships of busyness between products
from different departments.
The problem of how to layout a grocery store to maximize the purchases of the average customer
is discussed in many works, through both aspects of merchandising and mathematics. As mentioned
by one article, grab-and-go items such as bottled water and snacks should be placed near the
entrance; Deli and Coffee Bar should be placed in one of the front corners to attract hungry
customers; Cooking Ingredients, and Canned Goods should be placed in the center aisles to draw
customers to walk deeper and shop through nonessential items.[8] There are also many economists
and mathematicians working on similar problems. In the paper written by Boros, P., Fehér, O.,
Lakner, Z., traveling salesman problem (TSP) was used to maximize the shortest walking distance
3
for each customer according to different arrangements of the departments in the store.[9] The results
showed that the total walking distances of customers increased in the proposed new layout.[9] Chen
Li from University of Pittsburgh modeled the department allocation design problem as a multiple
knapsack problem and optimized the adjacency preference of departments to get possible maximum
exposure of items in the store, and try to give out an effective layout.[10] Similar optimization was
used in the paper by Elif Ozgormus from Auburn University.[11] To access the revenue of the store
layout, she used stochastic simulation and classified departments in to groups where customers
often purchase items from them concurrently.[11] By limiting space, unit revenue production and
department adjacency in the store, she optimized the impulse purchase and customer satisfaction
to get a desired layout.[11]
All three papers have similar basic objectives to ours. The paper by Boros et al. was aiming to
maximize the total walking distance of each customer and thus promote sales of the store.[9] Li’s
paper also focused on profit maximization but with considerations of the exposure of the items and
adjacencies between departments.[10] He is the first person to incorporate aisle structure, depart-
ment allocation, and departmental layout together into a comprehensive research.[10] The paper by
Ozgormus took revenue and adjacency into consideration and worked on the model specifically for
grocery stores towards the objectives of maximizing revenue and adjacency satisfaction.[11] In our
paper, we simply focus on the busyness of different departments and use multidimensional scaling
to model the similarities between each department and thus provide solid evidence for designing
an efficient and profitable layout. Instead of having data on comprehensive customer behavior in
the store, we have data of sales from the register point of view.
3 The Model
As a result of the data acquisition process described in the Background section, we were able to
obtain an hourly breakdown of the number of items, total sales, and number of customers that
purchase items from the local QFC that we collected from. The data presents a 24-hour snapshot
of a standard day within the grocery store. The data was presented in individual printouts of each
department’s activity for the day, therefore the first step was to transcribe all of the information
from physical paper form onto an Excel spreadsheet. The results are presented in the Appendix.
The next step was to separate and normalize each of the different activity indicators based
on their departmental, as well as hourly, totals. In this way, we transformed the raw data into
standardized distributions whose area under the curve summed to one. Specifically, we separated
the data into three different 24 × 10 matrices (i.e. items, sales, and customers), where the rows
of the matrix represent the hourly data for a 24-hour time period and the columns represent the
each of the 10 departments. For each of these matrices we normalized each entry by their daily
departmental totals, i.e. for each department (or column) we divided each entry in the column by
the summed total of the column:
MATLAB Code:
for i = 1:10
items_normD(:,i) = items_raw(:,i)/sum(items_raw(:,i));
sales_normD(:,i) = sales_raw(:,i)/sum(sales_raw(:,i));
cust_normD(:,i) = cust_raw(:,i)/sum(cust_raw(:,i));
end
Additionally, we normalized each of the 24 rows (hourly data) by the row sum of the activity for
that particular hour throughout all departments:
MATLAB Code:
for i = 1:24
items_normH(i,:) = items_raw(i,:)/sum(items_raw(i,:));
sales_normH(i,:) = sales_raw(i,:)/sum(sales_raw(i,:));
cust_normH(i,:) = cust_raw(i,:)/sum(cust_raw(i,:));
end
These calculations were performed on a mid-2010 Macbook Pro, running Windows 7 - SP1, in
MATLAB R2016b Student edition. The calculations were instantaneous. The result of this nor-
malization process resulted in 6 different datasets of customer activity, i.e. the number of items,
4
sales, and the number of customers each normalized by their daily departmental totals and addition-
ally by their hourly store totals. We ran each of these data sets through the distance calculations,
described below, in order to generate different variations of the information, ultimately in search
of the best “goodness of fit.”
In order to create an MDS model of the above mentioned data sets, our next step was to run
each data sets through our distance algorithm in order to calculate a single dimensional distance
between different departments. In other words, we iterated through each of the departments, a,
and compared them to each of the other department’s, b, hourly customer activity. We utilized
the Minkowski distance formula for our distance calculations [7]:
distance =
24
i=1
|ra,i − rb,i|p
1
p
where, i represents the hourly time period (e.g. i = 1 represents 12 o’clock AM to 1 o’clock AM),
a and b represent each of the different departments, and p represents the power of the Minkowski
algorithm. The most common powers, p, that are considered are powers of 1, 2, and ∞. A power
of 1 is commonly referred to as the Manhattan distance, a power of 2 is commonly referred to as
the Euclidean distance, and power ∞ is commonly referred to as Supremum distance. We used R
version 3.3.2 on a Late 2013 MacBook Pro running macOS 10.12.1 to carry out our calculations,
which ran instantly. Specifically, we ran the following commands in R:
library ( readr )
library ( wordcloud )
items <− read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " )
d <− d i s t ( items , method = " e u c l i d i a n " )
l l <− cmdscale (d , k = 2)
textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE)
Step-by-step, here is what the commands do:
library ( readr )
library ( wordcloud )
These commands import libraries that allow us to read the CSV file and create the plot.
items <− read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " )
This command reads in the formatted 24-dimensional vectors corresponding to each department
from the file “ItemsHourLabel.csv” into a table called “items”. The file “ItemsHourLabel.csv” con-
sists of rows that look like this:
Department,00:00 - 01:00,01:00 - 02:00,02:00 - 03:00,03:00 - 04:00,...
Packaged Produce,0,0,0,0.011299,0,0,0.022599,0.00565,0.022599,...
Deli,0.006135,0,0,0,0,0.02454,0.006135,0.02454,0.018405,0.02454,...
Bakery,0.001661,0,0,0,0,0.021595,0.019934,0.059801,0.043189,...
.
.
.
In this case, each row represents the number of items sold in each department in a given hour
divided by the total number of items sold in the department over the course of the day. The
department names at the beginning of each row are used for the graphic output.
d <− d i s t ( items , method = " e u c l i d i a n " )
This command takes the table “items” and creates a matrix of distances between every row of the
table. Here, the distance method is specified as “euclidian”, which means that the distance between
5
row i and row j will be calculated as
dij =
24
i=1
|ra,i − rb,i|
2
.
l l <− cmdscale (d , k = 2)
Here, the k = 2 specifies a two-dimensional model. The output is a list of two-dimensional coor-
dinates, one for each object in the original set:
> head(ll, 10)
[,1] [,2]
[1,] -0.032088329 0.01770756
[2,] -0.027631806 0.02097795
[3,] -0.028511119 0.05441644
[4,] -0.013549396 -0.01713736
[5,] -0.086806729 -0.06648990
[6,] -0.007476898 -0.01173682
[7,] -0.010818238 -0.02144684
[8,] -0.001610913 0.18130208
[9,] -0.045186100 -0.12261632
[10,] 0.253679528 -0.03497679
textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE)
This command plots the result with the names of the departments. ll[,1], ll[,2] specifies
that the first column of ll gives the x-coordinates and the second column gives the y-coordinates.
items[,1] specifies that the first column of the table “items” gives the labels for the data points.
ann = FALSE removes the x and y labels from the plot. The results of these commands are
presented in the following section.
4 Results
4.1 Hourly
In order to draw conclusions about the two-dimensional representation of our data, we can compare
them to the original data after it has been normalized by the hourly store totals. The result of
these datasets is the 2D plot of the items per hour:
6
We immediately see that the dairy department and fresh produce department differ from the rest
of the data. Similarly, the coffee shop and bakery differ significantly. We then wish to find two
differences in the data that may be causing the differences and can be used as the dimensions of
our plot. A plot of the items sold over the course of the day in each department follows:
We can see that the dairy department and fresh produce department both sell more than double
any other department at their respective peaks, which occur at approximately the same time in
the day. So, the horizontal dimension of our 2D representation of the data corresponds to this
large peak between the hours of 12 p.m. and 8 p.m. This is further supported by the fact that
the dry goods department and the bakery follow this trend to a lesser degree (less than dairy
7
and fresh produce but more than the other departments), so they are closer to the right side
of our plot. Nothing immediately stands out from the raw data that indicates that the coffee
shop and the bakery differ from the rest of the departments in any meaningful way. We can in-
stead look at the normalized data to see what may be the cause of this vertical distance in the plot:
Here, we see that the coffee shop and the bakery sell the majority of the total items sold in the store
between about 6 a.m. and 9 a.m. This does seem to make sense, as many people may be purchasing
coffee and/or baked goods in the morning for breakfast. However, this second dimension tells us
that the departments differ in the times at which they are the most active, which we already knew
from our first dimension and the fact that our data is separated by departments and time intervals.
Consequently, this second dimension is not very useful.
Examining the other two 2D plots of the data normalized by hourly totals, i.e. sales and cus-
tomer count, leads to similar conclusions. That is, the axes of the plots are dependent on the times
at which business activity spikes in each department. If we now consider the example of the sales,
we see that the 2D representation is essentially the same as with the previous dataset:
8
While the distances are altered slightly, the plot is otherwise simply inverted. The results for the
customer data are very similar and are included in the Appendix.
4.2 Daily
Similar to when the data was normalized by the hourly totals, the 2D representations of our data
normalized by daily totals exhibits a relationship between departments that are busiest at the same
times:
For example, in the above plot of the items sold in each department, we see that the coffee shop
is far away from the seafood department. By looking at the raw data of the number of items sold
per department over the course of the day (included above in this section), there does not seem to
be anything contrasting the coffee shop and seafood in any meaningful way. Instead, we can look
directly at the normalized data:
9
We can see that the coffee shop is the busiest early in the day between 9 a.m. and 12 p.m. with
another spike around 2 p.m. Conversely, the seafood department does the most business between
3 p.m. and 6 p.m. The rest of the departments, other than the sushi department, seem to increase
their business steadily throughout the day and peak in the late afternoon. This leads us to the
conclusion that one axis in our plots corresponds to the time at which each department does most
of its business. However, there is also a second dimension that appears to depend only on sushi.
Looking at the 2D representation of the sales over the course of the day, we again see this strange
distance between the sushi department and the rest of the store:
When we look at the raw sales data for the sushi department, the only aspects that stand out are
the fact that the department does relatively little business and that the department only has three
time periods when there are any transactions at all. There are two spikes around lunch time and
again around dinner time, but there is another single sushi sale between midnight and 1 a.m.
10
The "sushi dimension" could be a result of either the two periods of activity or the fact that the
sushi department is one of the only departments to make a sale at the late hour. The former does
not seem to be the case because all of the departments go through a rise and fall of sales over
the course of a day. Alternatively, if the latter is true, the “sushi dimension” is not particularly
interesting since we are only analyzing one day’s worth of data and the single sale is more than
likely not indicative of a trend of late night sushi purchases. In either case, the second dimension
of our plot is not really helpful in determining the similarity of any two departments. So, we can
perform another dimension reduction in order to create a one-dimensional model for our data.
The plot of customer data was omitted from the discussion because of its similarity to the item
and sales data sets. The results are presented in the Appendix. Our next step was to consider
adjustments to the Minkowski powers and MDS dimensions in our model.
5 Adjustments and Extensions
5.1 Goodness of Fit
Our ultimate goal in generating different variations of the MDS model was to find a model with the
optimal "goodness of fit," (GoF) for each of the above-mentioned data sets. Goodness of fit is a
measure of how well the MDS model fits the original data based on a choice of MDS dimensions and
Minkowski powers. For each of the different customer activities (items, sales, and customers), and
for the two different normalization methods by hour and by department (or by day), we evaluated
how changing the MDS dimension and changing the Minkowski power affected the goodness of fit
of our model. We considered each of the MDS dimensions between 1 and 9 because our model
contained 10 departments. As the dimension of our MDS model is increased we expected to see the
goodness of fit increase accordingly. We also considered the 3 most common Minkowski powers,
p = 1 which corresponds to the Manhattan distance or 1-norm, p = 2 which corresponds to the
Euclidean distance or 2-norm, and p = ∞ which corresponds to the maximum distance or infinity
norm.
We can use R to find the GoF data in a similar fashion to how we obtained our original model.
The entire code is included below:
11
library ( wordcloud )
items <− read . csv ( f i l e = "ItemsDayLabel . csv " , head = TRUE, sep = " , " )
d <− d i s t ( items , method = " e u c l i d i a n " ) # 2−norm
# d <− d i s t ( items , method = "manhattan ") # 1−norm
# d <− d i s t ( items , method = "maximum") # sup norm
cmdscale (d , k = 1 , eig=TRUE)$GOF # k i s the dimension
We can choose between one of the three distance measures depending on which norm we are testing.
Similarly, we can use the following command to change dimensions:
cmdscale (d , k = 1 , eig=TRUE)$GOF
This command returns a goodness of fit value between 0 and 1, where a value of 1 indicates a per-
fect fit, or direct correlation, and a value of 0 indicates uniform randomness. k = 1 corresponds to
the dimension of our data, which we let range from 1 to n − 1 = 9 where n = 10 is the dimension
of our data (i.e. the number of departments). The results are presented in the graphs below:
For the customer data, we can see that a Minkowski power of 1, Manhattan, seems to produce
models with the best goodness of fit over most MDS dimensions. In other words, the red line
is consistently higher than rest. Next, we are interesting in finding the lowest MDS dimension
that sufficiently models the data. For the customer by department (or day) data, we see that
a dimension of 1 leads to a GoF of about 0.46. While this is acceptable in some situations, we
also noticed that by raising the dimension of our MDS model to 2, our GoF is increased to 0.78.
Therefore, to optimize our MDS model for this particular data set, we chose a Minkowski power
of 1 and an MSD dimension of 2. On the other hand, if we examine the customers by hour plot,
we can see that a Manhattan Minkowski in a 1-D MDS model produces a goodness of fit of 0.72.
Therefore, this particular set of choices is sufficient in capturing the inherent trends present within
our original data set.
12
We noticed a similar trend in the items and sales data. The Manhattan Minkowski distance
calculation, i.e. p = 1, seems to produce the best GoF over most of the MDS dimensions between 1
and 9. Examining the plots of items and sales by department, we see that 1-D MDS models do not
sufficiently encapsulate the multi-dimensional interactions present in these data sets, producing a
GoF of 0.45 and 0.51 respectively. However, if we examine the GoF of fit for these data sets in a
2-D MDS model, 0.77 and 0.76 respectively, we can see that there is a significant increase in the
GoF indicating that a 2-D MDS model is a significantly better fit for these data sets. Additionally,
if we examine the items and sales per hour, we notice that the 1-D Manhattan MDS models seems
to be sufficient for modeling the original data set, producing a GoF of 0.76 and 0.81 respectively.
Goodness of fit tables for each of these data sets are presented in the Appendix.
5.2 Changing the Dimension
In formulating our problem, we made the assumption that our one day of data is meaningful in the
larger scheme of business at QFC. Although no single day can be indicative of the general patterns
at the store, we are working under the assumption that there are some trends present in our data
that may provide insight into the store in general. We could improve our model by obtaining more
data from QFC, at which point we may be able to have more evidence that any relationships we
find between departments are accurate. However, we would need a lot of data over a long period
of time in order to proceed in this manner. Seeing as how this data is probably very valuable to
the company and how difficult it was for us to obtain a single day’s worth of data, this is not a
practical way forward.
As we saw in Section 5.1, calculating our plots using one dimension and the Manhattan distance
seemed to produce a high enough goodness of fit. So, we can perform our scaling again in 1D rather
than 2D in an attempt to remove the excess dimension we saw in our original results. As has been
the case so far, we expect this relationship to depend on the time of day at which each department
does the most business. In any case, we can alter our R code slightly to reflect this change in our
model:
library ( readr )
13
library ( wordcloud )
items <− read . csv ( f i l e = " SalesHourLabel . csv " , head = TRUE, sep = " , " )
d <− d i s t ( items , method = "manhattan" )
l l <− cmdscale (d , k = 1)
# Column of zeros used to p l o t a l i n e in one dimension
textplot ( l l , c (0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0) , items [ , 1 ] , yaxt = ’n ’ , ann = FALSE)
If we now compare our raw data to our 1D representations, we see a stronger relationship between
the dimension and the data itself. Consider first the number of items sold over the course of each
hour:
The fresh produce department and the dairy department sell the most items at their peaks. This
is reflected in the plot as those two departments are the furthest away from the rest. In fact, if we
go through the lines from top to bottom in the plot on the left, we will see that this is exactly the
order in which the departments appear from right to left in the second plot. We can see the same
relationship reflected in the plots for sales per hour and customers per hour:
14
To verify this trend, also notice that fresh produce department has the highest peak in the sales
data and is further to the right than the dairy department. Similarly, in the customer data, the
dairy department is further to the right of the fresh produce department because the amount of
customers served in between the hours of 4 p.m. and 5 p.m. is greater. So, the distances in our
scaled plots seem to correspond to the height of each peak between 4 p.m. and 5 p.m., which
provides insight into the maximum activity at what is the busiest hour at QFC.
As was the case in the original 2D MDS plots of the data normalized by daily totals, there
seems to be something unique about the sushi department in the 1D representations. In particular,
this relationship is not immediately obvious from the raw data itself. We can first compare the
normalized plot of customers served over the course of the day as compared to the 1D representa-
tion of the departments:
What stands out in the plot on the right is the fact that the coffee shop and the sushi department
are the furthest apart. When we look at the plot on the left, we notice that the coffee shop serves
the highest percentage of its total customers early in the day. In particular, it serves the highest
percentage of any department between 10 a.m. and 11 a.m., while the sushi department serves
none. We know that this particular hour, rather than any of the other morning hours, accounts for
the distances in the 1D plot because of the seafood department. That is, the seafood department
does not serve its first customer until this hour, and it is closer to the rest of the departments than
to the sushi department. If the plot were reflecting the differences at an earlier time, then the
seafood department would presumably be right next to the sushi department since neither serves
a customer. This relationship is again apparent in the other two datasets:
15
5.3 Takeaways
We have seen that a 1D representation of our data is the most fitting when it has been normalized
by hourly totals. The GoF values for these three datasets are reasonably high, and the resulting
plots accurately reflect the peak activity in each department at the busiest hour. This information
can be useful in planning how to organize a store when the most business is being done.
Conversely, 2D representations of the data when normalized by daily totals seem to be more
useful than the 1D plots. While the 2D plots have one dimension relating to activity at certain
time periods throughout the day (e.g. breakfast time, lunch time, and dinner time) and another
relating to the business of departments at one particular hour, the 1D plots only give us insight
into the latter. This information is ultimately not helpful in coming to any meaningful conclusions
about the activity patterns in each department because of the fact that we only have data from
one day. Despite the superfluous second dimension, the 2D plots still have one useful dimension,
whereas the 1D plots do not have any.
Hence, we can best utilize our data to evaluate peak traffic between 4 p.m. and 5 p.m. by
normalizing by hourly totals and comparing one-dimensional representations of the departments.
Additionally, we can see broad trends in business by normalizing our data by daily totals and
representing it in two dimensions. In order to verify the apparent trends, though, we would still
need to obtain a larger dataset.
6 Conclusion
6.1 Object of study
From the result we got above, we can see several departments are similarly busy at the same time,
such as the meat, fresh produce and seafood departments. To avoid congestion in some parts of
the grocery store and to maximize the possibility of money customers would spend in the store,
the store owner is better to separate these departments.
6.2 Limitations
First of all, we only have data for one particular day in that store. This would definitely generate
some bias on our data and thus make our model less credible. Also, our data are statistics from
the register point of view. What we have are the actual purchases in each department, which is
only part of the customer behavior. Further, in reality the arrangements of departments could not
be flexible. They could be restricted by the locations of warehouses or workbenches. For instance,
the sushi department needs a workbench to make fresh sushi every day; a department containing
heavy items would prefer somewhere close to its warehouse; a coffee shop would definitely be close
to the entrance or exit. For these departments, the location and size are predetermined at the
point of the construction of the store.
6.3 Future work
In terms of modeling busyness of departments, we are currently based on number of customers,
number of items sold and revenue in each department. There are basic observations from the
registers. What happens before people checking out would also be worth considering. If possible,
16
we could collect data on the time an average customer spent in each department, regardless of
whether he/she buys something in that department. Similarly, the number of customers physically
appeared in each department also measures the busyness of that department.
In terms of generating a layout the maximizes the sales in the store, there are many aspects
worth deeper discussions. In addition to locations of different departments, we could take sizes
of departments into consideration. Detailed placements and sizes of aisles, shelves and items on
each shelf would also have significant impact on sales in the store. This would be more realistic
since it is easier to make changes on them than on the predetermined locations of departments.
Completely different models and more complicated modeling methods would be required to identify
the interrelationships of locations and sizes between different aisles and shelves.
17
References
[1] Richardson, M. W. (1938). Psychological Bulletin, 35, 659-660
[2] Torgerson. W. S. (1952). Psychometrika. 17. 401-419. (The first major MDS breakthrough.)
[3] Young. F. W. (1984). Research Methods for Multimode Data Analvsis in the Behavioral
Sciences. H. G. Law, C. W. Snyder, J. Hattie, and R. P. MacDonald, eds. (An advanced
treatment of the most general models in MDS. Geometrically oriented. Interesting political
science example of a wide range of MDS models applied to one set of data.)
[4] Machado JT, Mata ME (2015) Analysis of World Economic Variables Using Multidimensional
Scaling. PLOS ONE 10(3): e0121277
http://dx.doi.org/10.1371/journal.pone.0121277
[5] Anand S, Sen A. The Income Component of the Human Development Index. Journal of
Human Development. 2000;1
http://dx.doi.org/10.1371/journal.pone.0121277
[6] World Development Indicators, The World Bank,Time series,17-Nov-2016
http://data.worldbank.org/data-catalog/world-development-indicators
[7] Wikipedia contributors. "Minkowski distance." Wikipedia, The Free Encyclopedia. Wikipedia,
The Free Encyclopedia, 1 Nov. 2016. Web. 1 Nov. 2016.
https://en.wikipedia.org/w/index.php?title=Minkowski_distance&oldid=747257101
[8] Editor of Real Simple. "The Secrets Behind Your Grocery Store’s Layout." Real Simple. N.p.,
2012. Web. 29 Nov. 2016.
http://www.realsimple.com/food-recipes/shopping-storing/
more-shopping-storing/grocery-store-layout
[9] Boros, P., Fehér, O., Lakner, Z. et al. Ann Oper Res (2016) 238: 27. doi:10.1007/s10479-015-
1986-2.
http://link.springer.com/article/10.1007/s10479-015-1986-2
[10] Li, Chen. "A FACILITY LAYOUT DESIGN METHODOLOGY FOR RETAIL ENVIRON-
MENTS." D-Scholarship. N.p., 3 May 2010. Web. 29 Nov. 2016.
http://d-scholarship.pitt.edu/9670/1/Dissertation_ChenLi_2010.pdf
[11] Ozgormus, Elif. "Optimization of Block Layout for Grocery Stores." Auburn University. N.p.,
9 May 2015. Web. 29 Nov. 2016.
https://etd.auburn.edu/bitstream/handle/10415/4494/Eozgormusphd.pdf;sequence=2
18
Appendix
Link to Google Drive:
https://drive.google.com/drive/folders/0B-8II7_BkXIbTmZ0aEREQ2RzSzA?usp=sharing
A.1 Raw data
The printout of data we got from QFC. There are total 10 pages, 1 page for each department.
19
20
21
22
23
24
25
26
27
28
29
We converted the data in an excel table:
Plot of items sold over the course of the day:
30
Plot of sales made over the course of the day:
Plot customers served over the course of the day:
31
32
A.2 Normalized Data
Normalized-by-hour spreadsheet:
Items normalized by hour:
33
Sales normalized by hour:
Customers normalized by hour:
Normalized-by-day spreadsheet:
34
35
Items normalized by day:
36
Sales normalized by day:
Customers normalized by day:
37
38
A.3 2D MDS Results
2D plot of items sold, normalized by day:
2D plot of sales made, normalized by day:
2D plot of customers served, normalized by day:
39
2D plot of items sold, normalized by hour:
2D plot of sales made, normalized by hour:
40
2D plot of customers served, normalized by hour:
41
A.4 1D MDS Results
1D plot of items sold, normalized by day:
1D plot of sales made, normalized by day:
1D plot of customers served, normalized by day:
42
1D plot of items sold, normalized by hour:
1D plot of sales made, normalized by hour:
43
1D plot of customers served, normalized by hour:
44
A.4 Goodness of Fit Tables
Table 1: Customers by Dept. (or Day)
GOF Euclidian Manhattan Supremum
1 0.3963199 0.4584881 0.3978179
2 0.7600668 0.7807252 0.719283
3 0.8587553 0.8875351 0.866924
4 0.9253756 0.93352 0.9277296
5 0.9536219 0.9620621 0.9548165
6 0.9788168 0.9811779 0.9647046
7 0.9909951 0.99114 0.9672805
8 0.9963022 0.9968744 0.9672805
9 1 0.9968744 0.9672805
Table 2: Customers by Hour
GOF Euclidian Manhattan Supremum
1 0.5850668 0.7160555 0.4549306
2 0.776274 0.8507424 0.6912069
3 0.8732725 0.9249754 0.8543061
4 0.9398181 0.9695998 0.920125
5 0.9792807 0.9774312 0.9595707
6 0.9920085 0.9774312 0.9745372
7 0.9962814 0.9774312 0.9773711
8 0.9983624 0.9774312 0.9773711
9 1 0.9774312 0.9773711
Table 3: Items by Dept. (or Day)
GOF Euclidian Manhattan Supremum
1 0.4293083 0.4496683 0.4001874
2 0.7539912 0.7742156 0.6793937
3 0.8888276 0.9034419 0.8202917
4 0.933347 0.9557358 0.8875166
5 0.9597362 0.9777854 0.9168588
6 0.9839133 0.9903894 0.9333318
7 0.9932868 0.9957496 0.9356707
8 0.9976301 0.9989588 0.9356707
9 1 0.9989588 0.9356707
45
Table 4: Items by Hour
GOF Euclidian Manhattan Supremum
1 0.6427679 0.757986 0.4270805
2 0.8166962 0.8722931 0.6788556
3 0.9056558 0.9410536 0.8985873
4 0.9580441 0.9824214 0.9485446
5 0.9869886 0.9876273 0.9670813
6 0.9944574 0.9888287 0.9815083
7 0.9974807 0.9892658 0.9870639
8 0.9991767 0.9892658 0.9870639
9 1 0.9892658 0.9870639
Table 5: Sales by Dept. (or Day)
GOF Euclidian Manhattan Supremum
1 0.4846022 0.5108597 0.4146601
2 0.7679321 0.7631536 0.6745012
3 0.8678356 0.8797308 0.8164288
4 0.9385633 0.9405769 0.9091716
5 0.9595338 0.971421 0.9468887
6 0.9772032 0.9925368 0.9649131
7 0.990547 0.9976865 0.9729114
8 0.9965854 0.9997359 0.9729114
9 1 0.9997359 0.9729114
Table 6: Sales by Hour
GOF Euclidian Manhattan Supremum
1 0.5269157 0.6498416 0.3607062
2 0.7274713 0.8119829 0.5933631
3 0.8401425 0.8910053 0.7961126
4 0.9072517 0.9563971 0.8790723
5 0.953552 0.9757764 0.9359311
6 0.9754582 0.9857487 0.9747406
7 0.9928391 0.9885393 0.9834409
8 0.9971796 0.9885393 0.9850726
9 1 0.9885393 0.9850726
46

Grocery Store Classification Model

  • 1.
    Math 381 ProjectTwo Group 9 Alex Forney Keren Lai Gerard Trimberger Xinyu Zhou December 7, 2016 1
  • 2.
    1 Introduction When webuy products in grocery store, we find the things we want to buy are usually not located near each another, and it is common to find that one part of the store is crowded while others have few customers. This may be because store managers or other higher-ups plan the store layout while taking into consideration the similarities of products’ sales. He/she may place items often purchased together in locations farther apart in the store. So, customers may need to stay in the store longer, resulting in these customers seeing more items and potentially purchasing them. Another added benefit may be the reduction of congestion is departments with popular items. In our project, we seek to find the relationships between different departments of a grocery store using multidimensional scaling (MDS). We will plot the activity of 10 different departments (Packaged Produce, Deli, Bakery, Dairy, Meat, Dry goods, Fresh Produce, Coffee shop, Seafood, and Sushi) in order to show the similarities and differences between them. The result of our study may provide insight into the planning of grocery stores and/or customer habits. 2 Background 2.1 Idea We began the brainstorming process by each formulating a list of topics that we were interested in, both mathematically and socially. We also created a list of our individual skill sets and experience that we felt was relevant to the project. We then spent time reading through each of our responses to get an idea of what type of project we could all find interesting. We all agreed that we wanted to do something related to a common situation that most people experience on a daily basis. It is always more interesting if people can directly relate to the project rather than working on something that they do not have personal experience with. Our second criteria, was that we each wanted to do something related to probabilities or Monte Carlo simulation. Keren and Xinyu are ACMS/Economics double majors so they were both interested in the processes involved in economic development. Our first formulation of the proposal involved comparing the total sales and overall market share of different car manufacturers. We wanted to build a Markov chain of different manufacturer states and how they relate, in order to predict how the current market share distribution would change over time. Ultimately, we felt that we would be unable to obtain the necessary data for an interesting Markov chain, i.e. the number or probability of a car owner moving from one manufacturer or another. Other outside factors, such as owning multiple cars, created additional problems that we eventually felt would hinder our progress. At this point, we decided to switch gears. While keeping the original overarching goals in mind, specifically a publicly relatable problem and something probability/simulation based, we formulated a new proposal that involved simulating a grocery store checkout process. We planned on contacting a local grocery store for real-life customer and item distribution data. Gerard went into his local QFC on Friday, November 18th. He asked to speak with the manager of the store, and presented the situation to her, asking specifically if we could obtain some data for customer checkout times, their number of items, and what types of register (Normal, Express, or Self- Checkout) that they utilized to make their purchase. The manager suggested that he call back on Saturday (11/19) when the bookkeeper was present, because the bookkeeper is the one with access to that type of information. When Gerard called back on 11/19 he was informed that the bookkeeper had called in sick, and that he would either have to call back on Monday or to try a different store. The manager provided a phone number to another store in the region that had their bookkeepers present on 11/19. Gerard followed through with this lead and presented the situation to the other store manager. This new store manager did not seem to comprehend the issue and advised Gerard to contact QFC Corporate for more information. Gerard then called the Corporate phone number provided and left a message on their answering machine informing them that we would like to talk as soon as possible. Gerard waited until Monday morning (11/21), and when he had not heard back from corporate, decided to contact the manager at the local QFC once again. This time he was able to speak directly to the bookkeeper of the store, and confirmed that there was customer data available in the computer system but that it may not be exactly what we were looking for. He provided his name and number and was told that if he did not hear back from the store later that day, to come in on Tuesday (11/22). Gerard did not receive a call 2
  • 3.
    during this time,so on Tuesday morning around 10 am he went in to the local QFC in person to observe the situation firsthand. Upon speaking to the manager, she led Gerard into the backroom of the store and introduced him to her bookkeeper. From this point, Gerard worked directly with the bookkeeper to obtain data that he felt could be useful to our project. Gerard was able to obtain an hour by hour breakdown of the activity (i.e. item count, sales amount, and customer count) of each of the 10 departments of the store (packaged produce, deli, bakery, dairy, meat, dry goods, fresh produce, seafood, coffee, and sushi). Unfortunately, this was not the data that we had originally intended on receiving for our grocery store checkout simulation, but that did not mean that it wasn’t useful. We met up as a team and discussed how we wanted to move forward with this new information. We brainstormed a proposal for a new project that we could formulate, based on the data that we were provided. We settled on creating an MDS model comparing the different departments on an hour by hour basis, based on their normalized distributions for each indicator. The details of the model are explained below. 2.2 Similar Modelings Multidimensional scaling (MDS) is a set of data analysis techniques that display the structure of distance-like data as a geometrical picture. Evolving from the work of Richardson, [1] Torgerson proposed the first MDS method and coined the term.[2]. MDS is now a general analysis technique used in a wide variety of fields, such as marketing, sociology, economies etc. In 1984, Young and Hamer published a book on the theory and applications of MDS, and they presented applications of MDS in marketing. [3] J.A. Tenreiro Machado and Maria Eugenia Mata from Portugal analyzed the world economic variables using multidimensional scaling[4] that is similar as we do. Tenreiro and Mata analyze the evolution of GDP per capita,[5] international trade openness, life expectancy and education tertiary enrollment in 14 countries from 1977 up to 2012[6] using MDS method. In their study, the objects are country economies characterized by means of a given set of variables evaluated during a given time period. They calculated the distance between i-th and j-th objects by taking difference of economic variables for them in several years period. They plot countries on the graph and distinguish countries by multiple aspects like human welfare, quality of life and growth rate. Tenreiro and Mata concluded from the graphs that the analysis on 14 countries over the last 36 years under MDS techniques proves that a large gap separates Asian partners from converging to the North-American and Western-European developed countries, in terms of potential warfare, economic development, and social welfare. The modeling Tenreiro and Mata use is similar as we do. In our projects, the objects are departments in grocery store. They studied the difference/similarity between country economies through years, while we study the difference/similarity between different departments through hours in a day. In Tenreiro and Mata’s research, the countries developed at the same time are close on the graphs; in our study, the store departments that are busy at the same time are close on the graphs. However, the database of our project is much smaller than theirs. We compared departments from the data of the number of items sale, customers’ number and the total amount sale at a given time period. Tenreiro and Mata’ s data is more dimensional, from GDP per capita, economic openness, life expectancy, and tertiary education etc. And also our project studies similarity of busyness from another side: percentage of each department sale at the given hour. 2.3 Similar Problems The objective of our project is to help the grocery store owner to plan the layout of different blocks of store and increase store’s sale by finding the interrelationships of busyness between products from different departments. The problem of how to layout a grocery store to maximize the purchases of the average customer is discussed in many works, through both aspects of merchandising and mathematics. As mentioned by one article, grab-and-go items such as bottled water and snacks should be placed near the entrance; Deli and Coffee Bar should be placed in one of the front corners to attract hungry customers; Cooking Ingredients, and Canned Goods should be placed in the center aisles to draw customers to walk deeper and shop through nonessential items.[8] There are also many economists and mathematicians working on similar problems. In the paper written by Boros, P., Fehér, O., Lakner, Z., traveling salesman problem (TSP) was used to maximize the shortest walking distance 3
  • 4.
    for each customeraccording to different arrangements of the departments in the store.[9] The results showed that the total walking distances of customers increased in the proposed new layout.[9] Chen Li from University of Pittsburgh modeled the department allocation design problem as a multiple knapsack problem and optimized the adjacency preference of departments to get possible maximum exposure of items in the store, and try to give out an effective layout.[10] Similar optimization was used in the paper by Elif Ozgormus from Auburn University.[11] To access the revenue of the store layout, she used stochastic simulation and classified departments in to groups where customers often purchase items from them concurrently.[11] By limiting space, unit revenue production and department adjacency in the store, she optimized the impulse purchase and customer satisfaction to get a desired layout.[11] All three papers have similar basic objectives to ours. The paper by Boros et al. was aiming to maximize the total walking distance of each customer and thus promote sales of the store.[9] Li’s paper also focused on profit maximization but with considerations of the exposure of the items and adjacencies between departments.[10] He is the first person to incorporate aisle structure, depart- ment allocation, and departmental layout together into a comprehensive research.[10] The paper by Ozgormus took revenue and adjacency into consideration and worked on the model specifically for grocery stores towards the objectives of maximizing revenue and adjacency satisfaction.[11] In our paper, we simply focus on the busyness of different departments and use multidimensional scaling to model the similarities between each department and thus provide solid evidence for designing an efficient and profitable layout. Instead of having data on comprehensive customer behavior in the store, we have data of sales from the register point of view. 3 The Model As a result of the data acquisition process described in the Background section, we were able to obtain an hourly breakdown of the number of items, total sales, and number of customers that purchase items from the local QFC that we collected from. The data presents a 24-hour snapshot of a standard day within the grocery store. The data was presented in individual printouts of each department’s activity for the day, therefore the first step was to transcribe all of the information from physical paper form onto an Excel spreadsheet. The results are presented in the Appendix. The next step was to separate and normalize each of the different activity indicators based on their departmental, as well as hourly, totals. In this way, we transformed the raw data into standardized distributions whose area under the curve summed to one. Specifically, we separated the data into three different 24 × 10 matrices (i.e. items, sales, and customers), where the rows of the matrix represent the hourly data for a 24-hour time period and the columns represent the each of the 10 departments. For each of these matrices we normalized each entry by their daily departmental totals, i.e. for each department (or column) we divided each entry in the column by the summed total of the column: MATLAB Code: for i = 1:10 items_normD(:,i) = items_raw(:,i)/sum(items_raw(:,i)); sales_normD(:,i) = sales_raw(:,i)/sum(sales_raw(:,i)); cust_normD(:,i) = cust_raw(:,i)/sum(cust_raw(:,i)); end Additionally, we normalized each of the 24 rows (hourly data) by the row sum of the activity for that particular hour throughout all departments: MATLAB Code: for i = 1:24 items_normH(i,:) = items_raw(i,:)/sum(items_raw(i,:)); sales_normH(i,:) = sales_raw(i,:)/sum(sales_raw(i,:)); cust_normH(i,:) = cust_raw(i,:)/sum(cust_raw(i,:)); end These calculations were performed on a mid-2010 Macbook Pro, running Windows 7 - SP1, in MATLAB R2016b Student edition. The calculations were instantaneous. The result of this nor- malization process resulted in 6 different datasets of customer activity, i.e. the number of items, 4
  • 5.
    sales, and thenumber of customers each normalized by their daily departmental totals and addition- ally by their hourly store totals. We ran each of these data sets through the distance calculations, described below, in order to generate different variations of the information, ultimately in search of the best “goodness of fit.” In order to create an MDS model of the above mentioned data sets, our next step was to run each data sets through our distance algorithm in order to calculate a single dimensional distance between different departments. In other words, we iterated through each of the departments, a, and compared them to each of the other department’s, b, hourly customer activity. We utilized the Minkowski distance formula for our distance calculations [7]: distance = 24 i=1 |ra,i − rb,i|p 1 p where, i represents the hourly time period (e.g. i = 1 represents 12 o’clock AM to 1 o’clock AM), a and b represent each of the different departments, and p represents the power of the Minkowski algorithm. The most common powers, p, that are considered are powers of 1, 2, and ∞. A power of 1 is commonly referred to as the Manhattan distance, a power of 2 is commonly referred to as the Euclidean distance, and power ∞ is commonly referred to as Supremum distance. We used R version 3.3.2 on a Late 2013 MacBook Pro running macOS 10.12.1 to carry out our calculations, which ran instantly. Specifically, we ran the following commands in R: library ( readr ) library ( wordcloud ) items <− read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " ) d <− d i s t ( items , method = " e u c l i d i a n " ) l l <− cmdscale (d , k = 2) textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE) Step-by-step, here is what the commands do: library ( readr ) library ( wordcloud ) These commands import libraries that allow us to read the CSV file and create the plot. items <− read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " ) This command reads in the formatted 24-dimensional vectors corresponding to each department from the file “ItemsHourLabel.csv” into a table called “items”. The file “ItemsHourLabel.csv” con- sists of rows that look like this: Department,00:00 - 01:00,01:00 - 02:00,02:00 - 03:00,03:00 - 04:00,... Packaged Produce,0,0,0,0.011299,0,0,0.022599,0.00565,0.022599,... Deli,0.006135,0,0,0,0,0.02454,0.006135,0.02454,0.018405,0.02454,... Bakery,0.001661,0,0,0,0,0.021595,0.019934,0.059801,0.043189,... . . . In this case, each row represents the number of items sold in each department in a given hour divided by the total number of items sold in the department over the course of the day. The department names at the beginning of each row are used for the graphic output. d <− d i s t ( items , method = " e u c l i d i a n " ) This command takes the table “items” and creates a matrix of distances between every row of the table. Here, the distance method is specified as “euclidian”, which means that the distance between 5
  • 6.
    row i androw j will be calculated as dij = 24 i=1 |ra,i − rb,i| 2 . l l <− cmdscale (d , k = 2) Here, the k = 2 specifies a two-dimensional model. The output is a list of two-dimensional coor- dinates, one for each object in the original set: > head(ll, 10) [,1] [,2] [1,] -0.032088329 0.01770756 [2,] -0.027631806 0.02097795 [3,] -0.028511119 0.05441644 [4,] -0.013549396 -0.01713736 [5,] -0.086806729 -0.06648990 [6,] -0.007476898 -0.01173682 [7,] -0.010818238 -0.02144684 [8,] -0.001610913 0.18130208 [9,] -0.045186100 -0.12261632 [10,] 0.253679528 -0.03497679 textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE) This command plots the result with the names of the departments. ll[,1], ll[,2] specifies that the first column of ll gives the x-coordinates and the second column gives the y-coordinates. items[,1] specifies that the first column of the table “items” gives the labels for the data points. ann = FALSE removes the x and y labels from the plot. The results of these commands are presented in the following section. 4 Results 4.1 Hourly In order to draw conclusions about the two-dimensional representation of our data, we can compare them to the original data after it has been normalized by the hourly store totals. The result of these datasets is the 2D plot of the items per hour: 6
  • 7.
    We immediately seethat the dairy department and fresh produce department differ from the rest of the data. Similarly, the coffee shop and bakery differ significantly. We then wish to find two differences in the data that may be causing the differences and can be used as the dimensions of our plot. A plot of the items sold over the course of the day in each department follows: We can see that the dairy department and fresh produce department both sell more than double any other department at their respective peaks, which occur at approximately the same time in the day. So, the horizontal dimension of our 2D representation of the data corresponds to this large peak between the hours of 12 p.m. and 8 p.m. This is further supported by the fact that the dry goods department and the bakery follow this trend to a lesser degree (less than dairy 7
  • 8.
    and fresh producebut more than the other departments), so they are closer to the right side of our plot. Nothing immediately stands out from the raw data that indicates that the coffee shop and the bakery differ from the rest of the departments in any meaningful way. We can in- stead look at the normalized data to see what may be the cause of this vertical distance in the plot: Here, we see that the coffee shop and the bakery sell the majority of the total items sold in the store between about 6 a.m. and 9 a.m. This does seem to make sense, as many people may be purchasing coffee and/or baked goods in the morning for breakfast. However, this second dimension tells us that the departments differ in the times at which they are the most active, which we already knew from our first dimension and the fact that our data is separated by departments and time intervals. Consequently, this second dimension is not very useful. Examining the other two 2D plots of the data normalized by hourly totals, i.e. sales and cus- tomer count, leads to similar conclusions. That is, the axes of the plots are dependent on the times at which business activity spikes in each department. If we now consider the example of the sales, we see that the 2D representation is essentially the same as with the previous dataset: 8
  • 9.
    While the distancesare altered slightly, the plot is otherwise simply inverted. The results for the customer data are very similar and are included in the Appendix. 4.2 Daily Similar to when the data was normalized by the hourly totals, the 2D representations of our data normalized by daily totals exhibits a relationship between departments that are busiest at the same times: For example, in the above plot of the items sold in each department, we see that the coffee shop is far away from the seafood department. By looking at the raw data of the number of items sold per department over the course of the day (included above in this section), there does not seem to be anything contrasting the coffee shop and seafood in any meaningful way. Instead, we can look directly at the normalized data: 9
  • 10.
    We can seethat the coffee shop is the busiest early in the day between 9 a.m. and 12 p.m. with another spike around 2 p.m. Conversely, the seafood department does the most business between 3 p.m. and 6 p.m. The rest of the departments, other than the sushi department, seem to increase their business steadily throughout the day and peak in the late afternoon. This leads us to the conclusion that one axis in our plots corresponds to the time at which each department does most of its business. However, there is also a second dimension that appears to depend only on sushi. Looking at the 2D representation of the sales over the course of the day, we again see this strange distance between the sushi department and the rest of the store: When we look at the raw sales data for the sushi department, the only aspects that stand out are the fact that the department does relatively little business and that the department only has three time periods when there are any transactions at all. There are two spikes around lunch time and again around dinner time, but there is another single sushi sale between midnight and 1 a.m. 10
  • 11.
    The "sushi dimension"could be a result of either the two periods of activity or the fact that the sushi department is one of the only departments to make a sale at the late hour. The former does not seem to be the case because all of the departments go through a rise and fall of sales over the course of a day. Alternatively, if the latter is true, the “sushi dimension” is not particularly interesting since we are only analyzing one day’s worth of data and the single sale is more than likely not indicative of a trend of late night sushi purchases. In either case, the second dimension of our plot is not really helpful in determining the similarity of any two departments. So, we can perform another dimension reduction in order to create a one-dimensional model for our data. The plot of customer data was omitted from the discussion because of its similarity to the item and sales data sets. The results are presented in the Appendix. Our next step was to consider adjustments to the Minkowski powers and MDS dimensions in our model. 5 Adjustments and Extensions 5.1 Goodness of Fit Our ultimate goal in generating different variations of the MDS model was to find a model with the optimal "goodness of fit," (GoF) for each of the above-mentioned data sets. Goodness of fit is a measure of how well the MDS model fits the original data based on a choice of MDS dimensions and Minkowski powers. For each of the different customer activities (items, sales, and customers), and for the two different normalization methods by hour and by department (or by day), we evaluated how changing the MDS dimension and changing the Minkowski power affected the goodness of fit of our model. We considered each of the MDS dimensions between 1 and 9 because our model contained 10 departments. As the dimension of our MDS model is increased we expected to see the goodness of fit increase accordingly. We also considered the 3 most common Minkowski powers, p = 1 which corresponds to the Manhattan distance or 1-norm, p = 2 which corresponds to the Euclidean distance or 2-norm, and p = ∞ which corresponds to the maximum distance or infinity norm. We can use R to find the GoF data in a similar fashion to how we obtained our original model. The entire code is included below: 11
  • 12.
    library ( wordcloud) items <− read . csv ( f i l e = "ItemsDayLabel . csv " , head = TRUE, sep = " , " ) d <− d i s t ( items , method = " e u c l i d i a n " ) # 2−norm # d <− d i s t ( items , method = "manhattan ") # 1−norm # d <− d i s t ( items , method = "maximum") # sup norm cmdscale (d , k = 1 , eig=TRUE)$GOF # k i s the dimension We can choose between one of the three distance measures depending on which norm we are testing. Similarly, we can use the following command to change dimensions: cmdscale (d , k = 1 , eig=TRUE)$GOF This command returns a goodness of fit value between 0 and 1, where a value of 1 indicates a per- fect fit, or direct correlation, and a value of 0 indicates uniform randomness. k = 1 corresponds to the dimension of our data, which we let range from 1 to n − 1 = 9 where n = 10 is the dimension of our data (i.e. the number of departments). The results are presented in the graphs below: For the customer data, we can see that a Minkowski power of 1, Manhattan, seems to produce models with the best goodness of fit over most MDS dimensions. In other words, the red line is consistently higher than rest. Next, we are interesting in finding the lowest MDS dimension that sufficiently models the data. For the customer by department (or day) data, we see that a dimension of 1 leads to a GoF of about 0.46. While this is acceptable in some situations, we also noticed that by raising the dimension of our MDS model to 2, our GoF is increased to 0.78. Therefore, to optimize our MDS model for this particular data set, we chose a Minkowski power of 1 and an MSD dimension of 2. On the other hand, if we examine the customers by hour plot, we can see that a Manhattan Minkowski in a 1-D MDS model produces a goodness of fit of 0.72. Therefore, this particular set of choices is sufficient in capturing the inherent trends present within our original data set. 12
  • 13.
    We noticed asimilar trend in the items and sales data. The Manhattan Minkowski distance calculation, i.e. p = 1, seems to produce the best GoF over most of the MDS dimensions between 1 and 9. Examining the plots of items and sales by department, we see that 1-D MDS models do not sufficiently encapsulate the multi-dimensional interactions present in these data sets, producing a GoF of 0.45 and 0.51 respectively. However, if we examine the GoF of fit for these data sets in a 2-D MDS model, 0.77 and 0.76 respectively, we can see that there is a significant increase in the GoF indicating that a 2-D MDS model is a significantly better fit for these data sets. Additionally, if we examine the items and sales per hour, we notice that the 1-D Manhattan MDS models seems to be sufficient for modeling the original data set, producing a GoF of 0.76 and 0.81 respectively. Goodness of fit tables for each of these data sets are presented in the Appendix. 5.2 Changing the Dimension In formulating our problem, we made the assumption that our one day of data is meaningful in the larger scheme of business at QFC. Although no single day can be indicative of the general patterns at the store, we are working under the assumption that there are some trends present in our data that may provide insight into the store in general. We could improve our model by obtaining more data from QFC, at which point we may be able to have more evidence that any relationships we find between departments are accurate. However, we would need a lot of data over a long period of time in order to proceed in this manner. Seeing as how this data is probably very valuable to the company and how difficult it was for us to obtain a single day’s worth of data, this is not a practical way forward. As we saw in Section 5.1, calculating our plots using one dimension and the Manhattan distance seemed to produce a high enough goodness of fit. So, we can perform our scaling again in 1D rather than 2D in an attempt to remove the excess dimension we saw in our original results. As has been the case so far, we expect this relationship to depend on the time of day at which each department does the most business. In any case, we can alter our R code slightly to reflect this change in our model: library ( readr ) 13
  • 14.
    library ( wordcloud) items <− read . csv ( f i l e = " SalesHourLabel . csv " , head = TRUE, sep = " , " ) d <− d i s t ( items , method = "manhattan" ) l l <− cmdscale (d , k = 1) # Column of zeros used to p l o t a l i n e in one dimension textplot ( l l , c (0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0) , items [ , 1 ] , yaxt = ’n ’ , ann = FALSE) If we now compare our raw data to our 1D representations, we see a stronger relationship between the dimension and the data itself. Consider first the number of items sold over the course of each hour: The fresh produce department and the dairy department sell the most items at their peaks. This is reflected in the plot as those two departments are the furthest away from the rest. In fact, if we go through the lines from top to bottom in the plot on the left, we will see that this is exactly the order in which the departments appear from right to left in the second plot. We can see the same relationship reflected in the plots for sales per hour and customers per hour: 14
  • 15.
    To verify thistrend, also notice that fresh produce department has the highest peak in the sales data and is further to the right than the dairy department. Similarly, in the customer data, the dairy department is further to the right of the fresh produce department because the amount of customers served in between the hours of 4 p.m. and 5 p.m. is greater. So, the distances in our scaled plots seem to correspond to the height of each peak between 4 p.m. and 5 p.m., which provides insight into the maximum activity at what is the busiest hour at QFC. As was the case in the original 2D MDS plots of the data normalized by daily totals, there seems to be something unique about the sushi department in the 1D representations. In particular, this relationship is not immediately obvious from the raw data itself. We can first compare the normalized plot of customers served over the course of the day as compared to the 1D representa- tion of the departments: What stands out in the plot on the right is the fact that the coffee shop and the sushi department are the furthest apart. When we look at the plot on the left, we notice that the coffee shop serves the highest percentage of its total customers early in the day. In particular, it serves the highest percentage of any department between 10 a.m. and 11 a.m., while the sushi department serves none. We know that this particular hour, rather than any of the other morning hours, accounts for the distances in the 1D plot because of the seafood department. That is, the seafood department does not serve its first customer until this hour, and it is closer to the rest of the departments than to the sushi department. If the plot were reflecting the differences at an earlier time, then the seafood department would presumably be right next to the sushi department since neither serves a customer. This relationship is again apparent in the other two datasets: 15
  • 16.
    5.3 Takeaways We haveseen that a 1D representation of our data is the most fitting when it has been normalized by hourly totals. The GoF values for these three datasets are reasonably high, and the resulting plots accurately reflect the peak activity in each department at the busiest hour. This information can be useful in planning how to organize a store when the most business is being done. Conversely, 2D representations of the data when normalized by daily totals seem to be more useful than the 1D plots. While the 2D plots have one dimension relating to activity at certain time periods throughout the day (e.g. breakfast time, lunch time, and dinner time) and another relating to the business of departments at one particular hour, the 1D plots only give us insight into the latter. This information is ultimately not helpful in coming to any meaningful conclusions about the activity patterns in each department because of the fact that we only have data from one day. Despite the superfluous second dimension, the 2D plots still have one useful dimension, whereas the 1D plots do not have any. Hence, we can best utilize our data to evaluate peak traffic between 4 p.m. and 5 p.m. by normalizing by hourly totals and comparing one-dimensional representations of the departments. Additionally, we can see broad trends in business by normalizing our data by daily totals and representing it in two dimensions. In order to verify the apparent trends, though, we would still need to obtain a larger dataset. 6 Conclusion 6.1 Object of study From the result we got above, we can see several departments are similarly busy at the same time, such as the meat, fresh produce and seafood departments. To avoid congestion in some parts of the grocery store and to maximize the possibility of money customers would spend in the store, the store owner is better to separate these departments. 6.2 Limitations First of all, we only have data for one particular day in that store. This would definitely generate some bias on our data and thus make our model less credible. Also, our data are statistics from the register point of view. What we have are the actual purchases in each department, which is only part of the customer behavior. Further, in reality the arrangements of departments could not be flexible. They could be restricted by the locations of warehouses or workbenches. For instance, the sushi department needs a workbench to make fresh sushi every day; a department containing heavy items would prefer somewhere close to its warehouse; a coffee shop would definitely be close to the entrance or exit. For these departments, the location and size are predetermined at the point of the construction of the store. 6.3 Future work In terms of modeling busyness of departments, we are currently based on number of customers, number of items sold and revenue in each department. There are basic observations from the registers. What happens before people checking out would also be worth considering. If possible, 16
  • 17.
    we could collectdata on the time an average customer spent in each department, regardless of whether he/she buys something in that department. Similarly, the number of customers physically appeared in each department also measures the busyness of that department. In terms of generating a layout the maximizes the sales in the store, there are many aspects worth deeper discussions. In addition to locations of different departments, we could take sizes of departments into consideration. Detailed placements and sizes of aisles, shelves and items on each shelf would also have significant impact on sales in the store. This would be more realistic since it is easier to make changes on them than on the predetermined locations of departments. Completely different models and more complicated modeling methods would be required to identify the interrelationships of locations and sizes between different aisles and shelves. 17
  • 18.
    References [1] Richardson, M.W. (1938). Psychological Bulletin, 35, 659-660 [2] Torgerson. W. S. (1952). Psychometrika. 17. 401-419. (The first major MDS breakthrough.) [3] Young. F. W. (1984). Research Methods for Multimode Data Analvsis in the Behavioral Sciences. H. G. Law, C. W. Snyder, J. Hattie, and R. P. MacDonald, eds. (An advanced treatment of the most general models in MDS. Geometrically oriented. Interesting political science example of a wide range of MDS models applied to one set of data.) [4] Machado JT, Mata ME (2015) Analysis of World Economic Variables Using Multidimensional Scaling. PLOS ONE 10(3): e0121277 http://dx.doi.org/10.1371/journal.pone.0121277 [5] Anand S, Sen A. The Income Component of the Human Development Index. Journal of Human Development. 2000;1 http://dx.doi.org/10.1371/journal.pone.0121277 [6] World Development Indicators, The World Bank,Time series,17-Nov-2016 http://data.worldbank.org/data-catalog/world-development-indicators [7] Wikipedia contributors. "Minkowski distance." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 1 Nov. 2016. Web. 1 Nov. 2016. https://en.wikipedia.org/w/index.php?title=Minkowski_distance&oldid=747257101 [8] Editor of Real Simple. "The Secrets Behind Your Grocery Store’s Layout." Real Simple. N.p., 2012. Web. 29 Nov. 2016. http://www.realsimple.com/food-recipes/shopping-storing/ more-shopping-storing/grocery-store-layout [9] Boros, P., Fehér, O., Lakner, Z. et al. Ann Oper Res (2016) 238: 27. doi:10.1007/s10479-015- 1986-2. http://link.springer.com/article/10.1007/s10479-015-1986-2 [10] Li, Chen. "A FACILITY LAYOUT DESIGN METHODOLOGY FOR RETAIL ENVIRON- MENTS." D-Scholarship. N.p., 3 May 2010. Web. 29 Nov. 2016. http://d-scholarship.pitt.edu/9670/1/Dissertation_ChenLi_2010.pdf [11] Ozgormus, Elif. "Optimization of Block Layout for Grocery Stores." Auburn University. N.p., 9 May 2015. Web. 29 Nov. 2016. https://etd.auburn.edu/bitstream/handle/10415/4494/Eozgormusphd.pdf;sequence=2 18
  • 19.
    Appendix Link to GoogleDrive: https://drive.google.com/drive/folders/0B-8II7_BkXIbTmZ0aEREQ2RzSzA?usp=sharing A.1 Raw data The printout of data we got from QFC. There are total 10 pages, 1 page for each department. 19
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
    We converted thedata in an excel table: Plot of items sold over the course of the day: 30
  • 31.
    Plot of salesmade over the course of the day: Plot customers served over the course of the day: 31
  • 32.
  • 33.
    A.2 Normalized Data Normalized-by-hourspreadsheet: Items normalized by hour: 33
  • 34.
    Sales normalized byhour: Customers normalized by hour: Normalized-by-day spreadsheet: 34
  • 35.
  • 36.
  • 37.
    Sales normalized byday: Customers normalized by day: 37
  • 38.
  • 39.
    A.3 2D MDSResults 2D plot of items sold, normalized by day: 2D plot of sales made, normalized by day: 2D plot of customers served, normalized by day: 39
  • 40.
    2D plot ofitems sold, normalized by hour: 2D plot of sales made, normalized by hour: 40
  • 41.
    2D plot ofcustomers served, normalized by hour: 41
  • 42.
    A.4 1D MDSResults 1D plot of items sold, normalized by day: 1D plot of sales made, normalized by day: 1D plot of customers served, normalized by day: 42
  • 43.
    1D plot ofitems sold, normalized by hour: 1D plot of sales made, normalized by hour: 43
  • 44.
    1D plot ofcustomers served, normalized by hour: 44
  • 45.
    A.4 Goodness ofFit Tables Table 1: Customers by Dept. (or Day) GOF Euclidian Manhattan Supremum 1 0.3963199 0.4584881 0.3978179 2 0.7600668 0.7807252 0.719283 3 0.8587553 0.8875351 0.866924 4 0.9253756 0.93352 0.9277296 5 0.9536219 0.9620621 0.9548165 6 0.9788168 0.9811779 0.9647046 7 0.9909951 0.99114 0.9672805 8 0.9963022 0.9968744 0.9672805 9 1 0.9968744 0.9672805 Table 2: Customers by Hour GOF Euclidian Manhattan Supremum 1 0.5850668 0.7160555 0.4549306 2 0.776274 0.8507424 0.6912069 3 0.8732725 0.9249754 0.8543061 4 0.9398181 0.9695998 0.920125 5 0.9792807 0.9774312 0.9595707 6 0.9920085 0.9774312 0.9745372 7 0.9962814 0.9774312 0.9773711 8 0.9983624 0.9774312 0.9773711 9 1 0.9774312 0.9773711 Table 3: Items by Dept. (or Day) GOF Euclidian Manhattan Supremum 1 0.4293083 0.4496683 0.4001874 2 0.7539912 0.7742156 0.6793937 3 0.8888276 0.9034419 0.8202917 4 0.933347 0.9557358 0.8875166 5 0.9597362 0.9777854 0.9168588 6 0.9839133 0.9903894 0.9333318 7 0.9932868 0.9957496 0.9356707 8 0.9976301 0.9989588 0.9356707 9 1 0.9989588 0.9356707 45
  • 46.
    Table 4: Itemsby Hour GOF Euclidian Manhattan Supremum 1 0.6427679 0.757986 0.4270805 2 0.8166962 0.8722931 0.6788556 3 0.9056558 0.9410536 0.8985873 4 0.9580441 0.9824214 0.9485446 5 0.9869886 0.9876273 0.9670813 6 0.9944574 0.9888287 0.9815083 7 0.9974807 0.9892658 0.9870639 8 0.9991767 0.9892658 0.9870639 9 1 0.9892658 0.9870639 Table 5: Sales by Dept. (or Day) GOF Euclidian Manhattan Supremum 1 0.4846022 0.5108597 0.4146601 2 0.7679321 0.7631536 0.6745012 3 0.8678356 0.8797308 0.8164288 4 0.9385633 0.9405769 0.9091716 5 0.9595338 0.971421 0.9468887 6 0.9772032 0.9925368 0.9649131 7 0.990547 0.9976865 0.9729114 8 0.9965854 0.9997359 0.9729114 9 1 0.9997359 0.9729114 Table 6: Sales by Hour GOF Euclidian Manhattan Supremum 1 0.5269157 0.6498416 0.3607062 2 0.7274713 0.8119829 0.5933631 3 0.8401425 0.8910053 0.7961126 4 0.9072517 0.9563971 0.8790723 5 0.953552 0.9757764 0.9359311 6 0.9754582 0.9857487 0.9747406 7 0.9928391 0.9885393 0.9834409 8 0.9971796 0.9885393 0.9850726 9 1 0.9885393 0.9850726 46