SlideShare a Scribd company logo
1 of 46
Download to read offline
Math 381 Project Two
Group 9
Alex Forney
Keren Lai
Gerard Trimberger
Xinyu Zhou
December 7, 2016
1
1 Introduction
When we buy products in grocery store, we ļ¬nd the things we want to buy are usually not located
near each another, and it is common to ļ¬nd that one part of the store is crowded while others
have few customers. This may be because store managers or other higher-ups plan the store layout
while taking into consideration the similarities of productsā€™ sales. He/she may place items often
purchased together in locations farther apart in the store. So, customers may need to stay in
the store longer, resulting in these customers seeing more items and potentially purchasing them.
Another added beneļ¬t may be the reduction of congestion is departments with popular items. In
our project, we seek to ļ¬nd the relationships between diļ¬€erent departments of a grocery store using
multidimensional scaling (MDS). We will plot the activity of 10 diļ¬€erent departments (Packaged
Produce, Deli, Bakery, Dairy, Meat, Dry goods, Fresh Produce, Coļ¬€ee shop, Seafood, and Sushi)
in order to show the similarities and diļ¬€erences between them. The result of our study may provide
insight into the planning of grocery stores and/or customer habits.
2 Background
2.1 Idea
We began the brainstorming process by each formulating a list of topics that we were interested in,
both mathematically and socially. We also created a list of our individual skill sets and experience
that we felt was relevant to the project. We then spent time reading through each of our responses
to get an idea of what type of project we could all ļ¬nd interesting. We all agreed that we wanted
to do something related to a common situation that most people experience on a daily basis.
It is always more interesting if people can directly relate to the project rather than working on
something that they do not have personal experience with. Our second criteria, was that we each
wanted to do something related to probabilities or Monte Carlo simulation. Keren and Xinyu
are ACMS/Economics double majors so they were both interested in the processes involved in
economic development.
Our ļ¬rst formulation of the proposal involved comparing the total sales and overall market
share of diļ¬€erent car manufacturers. We wanted to build a Markov chain of diļ¬€erent manufacturer
states and how they relate, in order to predict how the current market share distribution would
change over time. Ultimately, we felt that we would be unable to obtain the necessary data for
an interesting Markov chain, i.e. the number or probability of a car owner moving from one
manufacturer or another. Other outside factors, such as owning multiple cars, created additional
problems that we eventually felt would hinder our progress.
At this point, we decided to switch gears. While keeping the original overarching goals in
mind, speciļ¬cally a publicly relatable problem and something probability/simulation based, we
formulated a new proposal that involved simulating a grocery store checkout process. We planned
on contacting a local grocery store for real-life customer and item distribution data. Gerard went
into his local QFC on Friday, November 18th. He asked to speak with the manager of the store,
and presented the situation to her, asking speciļ¬cally if we could obtain some data for customer
checkout times, their number of items, and what types of register (Normal, Express, or Self-
Checkout) that they utilized to make their purchase. The manager suggested that he call back
on Saturday (11/19) when the bookkeeper was present, because the bookkeeper is the one with
access to that type of information. When Gerard called back on 11/19 he was informed that the
bookkeeper had called in sick, and that he would either have to call back on Monday or to try a
diļ¬€erent store. The manager provided a phone number to another store in the region that had
their bookkeepers present on 11/19. Gerard followed through with this lead and presented the
situation to the other store manager. This new store manager did not seem to comprehend the
issue and advised Gerard to contact QFC Corporate for more information. Gerard then called the
Corporate phone number provided and left a message on their answering machine informing them
that we would like to talk as soon as possible. Gerard waited until Monday morning (11/21), and
when he had not heard back from corporate, decided to contact the manager at the local QFC
once again. This time he was able to speak directly to the bookkeeper of the store, and conļ¬rmed
that there was customer data available in the computer system but that it may not be exactly
what we were looking for. He provided his name and number and was told that if he did not hear
back from the store later that day, to come in on Tuesday (11/22). Gerard did not receive a call
2
during this time, so on Tuesday morning around 10 am he went in to the local QFC in person to
observe the situation ļ¬rsthand.
Upon speaking to the manager, she led Gerard into the backroom of the store and introduced
him to her bookkeeper. From this point, Gerard worked directly with the bookkeeper to obtain
data that he felt could be useful to our project. Gerard was able to obtain an hour by hour
breakdown of the activity (i.e. item count, sales amount, and customer count) of each of the 10
departments of the store (packaged produce, deli, bakery, dairy, meat, dry goods, fresh produce,
seafood, coļ¬€ee, and sushi). Unfortunately, this was not the data that we had originally intended
on receiving for our grocery store checkout simulation, but that did not mean that it wasnā€™t useful.
We met up as a team and discussed how we wanted to move forward with this new information.
We brainstormed a proposal for a new project that we could formulate, based on the data that we
were provided. We settled on creating an MDS model comparing the diļ¬€erent departments on an
hour by hour basis, based on their normalized distributions for each indicator. The details of the
model are explained below.
2.2 Similar Modelings
Multidimensional scaling (MDS) is a set of data analysis techniques that display the structure of
distance-like data as a geometrical picture. Evolving from the work of Richardson, [1] Torgerson
proposed the ļ¬rst MDS method and coined the term.[2]. MDS is now a general analysis technique
used in a wide variety of ļ¬elds, such as marketing, sociology, economies etc. In 1984, Young and
Hamer published a book on the theory and applications of MDS, and they presented applications
of MDS in marketing. [3]
J.A. Tenreiro Machado and Maria Eugenia Mata from Portugal analyzed the world economic
variables using multidimensional scaling[4] that is similar as we do. Tenreiro and Mata analyze
the evolution of GDP per capita,[5] international trade openness, life expectancy and education
tertiary enrollment in 14 countries from 1977 up to 2012[6] using MDS method. In their study,
the objects are country economies characterized by means of a given set of variables evaluated
during a given time period. They calculated the distance between i-th and j-th objects by taking
diļ¬€erence of economic variables for them in several years period. They plot countries on the graph
and distinguish countries by multiple aspects like human welfare, quality of life and growth rate.
Tenreiro and Mata concluded from the graphs that the analysis on 14 countries over the last 36
years under MDS techniques proves that a large gap separates Asian partners from converging
to the North-American and Western-European developed countries, in terms of potential warfare,
economic development, and social welfare.
The modeling Tenreiro and Mata use is similar as we do. In our projects, the objects are
departments in grocery store. They studied the diļ¬€erence/similarity between country economies
through years, while we study the diļ¬€erence/similarity between diļ¬€erent departments through
hours in a day. In Tenreiro and Mataā€™s research, the countries developed at the same time are
close on the graphs; in our study, the store departments that are busy at the same time are close
on the graphs. However, the database of our project is much smaller than theirs. We compared
departments from the data of the number of items sale, customersā€™ number and the total amount
sale at a given time period. Tenreiro and Mataā€™ s data is more dimensional, from GDP per capita,
economic openness, life expectancy, and tertiary education etc. And also our project studies
similarity of busyness from another side: percentage of each department sale at the given hour.
2.3 Similar Problems
The objective of our project is to help the grocery store owner to plan the layout of diļ¬€erent blocks
of store and increase storeā€™s sale by ļ¬nding the interrelationships of busyness between products
from diļ¬€erent departments.
The problem of how to layout a grocery store to maximize the purchases of the average customer
is discussed in many works, through both aspects of merchandising and mathematics. As mentioned
by one article, grab-and-go items such as bottled water and snacks should be placed near the
entrance; Deli and Coļ¬€ee Bar should be placed in one of the front corners to attract hungry
customers; Cooking Ingredients, and Canned Goods should be placed in the center aisles to draw
customers to walk deeper and shop through nonessential items.[8] There are also many economists
and mathematicians working on similar problems. In the paper written by Boros, P., FehƩr, O.,
Lakner, Z., traveling salesman problem (TSP) was used to maximize the shortest walking distance
3
for each customer according to diļ¬€erent arrangements of the departments in the store.[9] The results
showed that the total walking distances of customers increased in the proposed new layout.[9] Chen
Li from University of Pittsburgh modeled the department allocation design problem as a multiple
knapsack problem and optimized the adjacency preference of departments to get possible maximum
exposure of items in the store, and try to give out an eļ¬€ective layout.[10] Similar optimization was
used in the paper by Elif Ozgormus from Auburn University.[11] To access the revenue of the store
layout, she used stochastic simulation and classiļ¬ed departments in to groups where customers
often purchase items from them concurrently.[11] By limiting space, unit revenue production and
department adjacency in the store, she optimized the impulse purchase and customer satisfaction
to get a desired layout.[11]
All three papers have similar basic objectives to ours. The paper by Boros et al. was aiming to
maximize the total walking distance of each customer and thus promote sales of the store.[9] Liā€™s
paper also focused on proļ¬t maximization but with considerations of the exposure of the items and
adjacencies between departments.[10] He is the ļ¬rst person to incorporate aisle structure, depart-
ment allocation, and departmental layout together into a comprehensive research.[10] The paper by
Ozgormus took revenue and adjacency into consideration and worked on the model speciļ¬cally for
grocery stores towards the objectives of maximizing revenue and adjacency satisfaction.[11] In our
paper, we simply focus on the busyness of diļ¬€erent departments and use multidimensional scaling
to model the similarities between each department and thus provide solid evidence for designing
an eļ¬ƒcient and proļ¬table layout. Instead of having data on comprehensive customer behavior in
the store, we have data of sales from the register point of view.
3 The Model
As a result of the data acquisition process described in the Background section, we were able to
obtain an hourly breakdown of the number of items, total sales, and number of customers that
purchase items from the local QFC that we collected from. The data presents a 24-hour snapshot
of a standard day within the grocery store. The data was presented in individual printouts of each
departmentā€™s activity for the day, therefore the ļ¬rst step was to transcribe all of the information
from physical paper form onto an Excel spreadsheet. The results are presented in the Appendix.
The next step was to separate and normalize each of the diļ¬€erent activity indicators based
on their departmental, as well as hourly, totals. In this way, we transformed the raw data into
standardized distributions whose area under the curve summed to one. Speciļ¬cally, we separated
the data into three diļ¬€erent 24 Ɨ 10 matrices (i.e. items, sales, and customers), where the rows
of the matrix represent the hourly data for a 24-hour time period and the columns represent the
each of the 10 departments. For each of these matrices we normalized each entry by their daily
departmental totals, i.e. for each department (or column) we divided each entry in the column by
the summed total of the column:
MATLAB Code:
for i = 1:10
items_normD(:,i) = items_raw(:,i)/sum(items_raw(:,i));
sales_normD(:,i) = sales_raw(:,i)/sum(sales_raw(:,i));
cust_normD(:,i) = cust_raw(:,i)/sum(cust_raw(:,i));
end
Additionally, we normalized each of the 24 rows (hourly data) by the row sum of the activity for
that particular hour throughout all departments:
MATLAB Code:
for i = 1:24
items_normH(i,:) = items_raw(i,:)/sum(items_raw(i,:));
sales_normH(i,:) = sales_raw(i,:)/sum(sales_raw(i,:));
cust_normH(i,:) = cust_raw(i,:)/sum(cust_raw(i,:));
end
These calculations were performed on a mid-2010 Macbook Pro, running Windows 7 - SP1, in
MATLAB R2016b Student edition. The calculations were instantaneous. The result of this nor-
malization process resulted in 6 diļ¬€erent datasets of customer activity, i.e. the number of items,
4
sales, and the number of customers each normalized by their daily departmental totals and addition-
ally by their hourly store totals. We ran each of these data sets through the distance calculations,
described below, in order to generate diļ¬€erent variations of the information, ultimately in search
of the best ā€œgoodness of ļ¬t.ā€
In order to create an MDS model of the above mentioned data sets, our next step was to run
each data sets through our distance algorithm in order to calculate a single dimensional distance
between diļ¬€erent departments. In other words, we iterated through each of the departments, a,
and compared them to each of the other departmentā€™s, b, hourly customer activity. We utilized
the Minkowski distance formula for our distance calculations [7]:
distance =
24
i=1
|ra,i āˆ’ rb,i|p
1
p
where, i represents the hourly time period (e.g. i = 1 represents 12 oā€™clock AM to 1 oā€™clock AM),
a and b represent each of the diļ¬€erent departments, and p represents the power of the Minkowski
algorithm. The most common powers, p, that are considered are powers of 1, 2, and āˆž. A power
of 1 is commonly referred to as the Manhattan distance, a power of 2 is commonly referred to as
the Euclidean distance, and power āˆž is commonly referred to as Supremum distance. We used R
version 3.3.2 on a Late 2013 MacBook Pro running macOS 10.12.1 to carry out our calculations,
which ran instantly. Speciļ¬cally, we ran the following commands in R:
library ( readr )
library ( wordcloud )
items <āˆ’ read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " )
d <āˆ’ d i s t ( items , method = " e u c l i d i a n " )
l l <āˆ’ cmdscale (d , k = 2)
textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE)
Step-by-step, here is what the commands do:
library ( readr )
library ( wordcloud )
These commands import libraries that allow us to read the CSV ļ¬le and create the plot.
items <āˆ’ read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " )
This command reads in the formatted 24-dimensional vectors corresponding to each department
from the ļ¬le ā€œItemsHourLabel.csvā€ into a table called ā€œitemsā€. The ļ¬le ā€œItemsHourLabel.csvā€ con-
sists of rows that look like this:
Department,00:00 - 01:00,01:00 - 02:00,02:00 - 03:00,03:00 - 04:00,...
Packaged Produce,0,0,0,0.011299,0,0,0.022599,0.00565,0.022599,...
Deli,0.006135,0,0,0,0,0.02454,0.006135,0.02454,0.018405,0.02454,...
Bakery,0.001661,0,0,0,0,0.021595,0.019934,0.059801,0.043189,...
.
.
.
In this case, each row represents the number of items sold in each department in a given hour
divided by the total number of items sold in the department over the course of the day. The
department names at the beginning of each row are used for the graphic output.
d <āˆ’ d i s t ( items , method = " e u c l i d i a n " )
This command takes the table ā€œitemsā€ and creates a matrix of distances between every row of the
table. Here, the distance method is speciļ¬ed as ā€œeuclidianā€, which means that the distance between
5
row i and row j will be calculated as
dij =
24
i=1
|ra,i āˆ’ rb,i|
2
.
l l <āˆ’ cmdscale (d , k = 2)
Here, the k = 2 speciļ¬es a two-dimensional model. The output is a list of two-dimensional coor-
dinates, one for each object in the original set:
> head(ll, 10)
[,1] [,2]
[1,] -0.032088329 0.01770756
[2,] -0.027631806 0.02097795
[3,] -0.028511119 0.05441644
[4,] -0.013549396 -0.01713736
[5,] -0.086806729 -0.06648990
[6,] -0.007476898 -0.01173682
[7,] -0.010818238 -0.02144684
[8,] -0.001610913 0.18130208
[9,] -0.045186100 -0.12261632
[10,] 0.253679528 -0.03497679
textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE)
This command plots the result with the names of the departments. ll[,1], ll[,2] speciļ¬es
that the ļ¬rst column of ll gives the x-coordinates and the second column gives the y-coordinates.
items[,1] speciļ¬es that the ļ¬rst column of the table ā€œitemsā€ gives the labels for the data points.
ann = FALSE removes the x and y labels from the plot. The results of these commands are
presented in the following section.
4 Results
4.1 Hourly
In order to draw conclusions about the two-dimensional representation of our data, we can compare
them to the original data after it has been normalized by the hourly store totals. The result of
these datasets is the 2D plot of the items per hour:
6
We immediately see that the dairy department and fresh produce department diļ¬€er from the rest
of the data. Similarly, the coļ¬€ee shop and bakery diļ¬€er signiļ¬cantly. We then wish to ļ¬nd two
diļ¬€erences in the data that may be causing the diļ¬€erences and can be used as the dimensions of
our plot. A plot of the items sold over the course of the day in each department follows:
We can see that the dairy department and fresh produce department both sell more than double
any other department at their respective peaks, which occur at approximately the same time in
the day. So, the horizontal dimension of our 2D representation of the data corresponds to this
large peak between the hours of 12 p.m. and 8 p.m. This is further supported by the fact that
the dry goods department and the bakery follow this trend to a lesser degree (less than dairy
7
and fresh produce but more than the other departments), so they are closer to the right side
of our plot. Nothing immediately stands out from the raw data that indicates that the coļ¬€ee
shop and the bakery diļ¬€er from the rest of the departments in any meaningful way. We can in-
stead look at the normalized data to see what may be the cause of this vertical distance in the plot:
Here, we see that the coļ¬€ee shop and the bakery sell the majority of the total items sold in the store
between about 6 a.m. and 9 a.m. This does seem to make sense, as many people may be purchasing
coļ¬€ee and/or baked goods in the morning for breakfast. However, this second dimension tells us
that the departments diļ¬€er in the times at which they are the most active, which we already knew
from our ļ¬rst dimension and the fact that our data is separated by departments and time intervals.
Consequently, this second dimension is not very useful.
Examining the other two 2D plots of the data normalized by hourly totals, i.e. sales and cus-
tomer count, leads to similar conclusions. That is, the axes of the plots are dependent on the times
at which business activity spikes in each department. If we now consider the example of the sales,
we see that the 2D representation is essentially the same as with the previous dataset:
8
While the distances are altered slightly, the plot is otherwise simply inverted. The results for the
customer data are very similar and are included in the Appendix.
4.2 Daily
Similar to when the data was normalized by the hourly totals, the 2D representations of our data
normalized by daily totals exhibits a relationship between departments that are busiest at the same
times:
For example, in the above plot of the items sold in each department, we see that the coļ¬€ee shop
is far away from the seafood department. By looking at the raw data of the number of items sold
per department over the course of the day (included above in this section), there does not seem to
be anything contrasting the coļ¬€ee shop and seafood in any meaningful way. Instead, we can look
directly at the normalized data:
9
We can see that the coļ¬€ee shop is the busiest early in the day between 9 a.m. and 12 p.m. with
another spike around 2 p.m. Conversely, the seafood department does the most business between
3 p.m. and 6 p.m. The rest of the departments, other than the sushi department, seem to increase
their business steadily throughout the day and peak in the late afternoon. This leads us to the
conclusion that one axis in our plots corresponds to the time at which each department does most
of its business. However, there is also a second dimension that appears to depend only on sushi.
Looking at the 2D representation of the sales over the course of the day, we again see this strange
distance between the sushi department and the rest of the store:
When we look at the raw sales data for the sushi department, the only aspects that stand out are
the fact that the department does relatively little business and that the department only has three
time periods when there are any transactions at all. There are two spikes around lunch time and
again around dinner time, but there is another single sushi sale between midnight and 1 a.m.
10
The "sushi dimension" could be a result of either the two periods of activity or the fact that the
sushi department is one of the only departments to make a sale at the late hour. The former does
not seem to be the case because all of the departments go through a rise and fall of sales over
the course of a day. Alternatively, if the latter is true, the ā€œsushi dimensionā€ is not particularly
interesting since we are only analyzing one dayā€™s worth of data and the single sale is more than
likely not indicative of a trend of late night sushi purchases. In either case, the second dimension
of our plot is not really helpful in determining the similarity of any two departments. So, we can
perform another dimension reduction in order to create a one-dimensional model for our data.
The plot of customer data was omitted from the discussion because of its similarity to the item
and sales data sets. The results are presented in the Appendix. Our next step was to consider
adjustments to the Minkowski powers and MDS dimensions in our model.
5 Adjustments and Extensions
5.1 Goodness of Fit
Our ultimate goal in generating diļ¬€erent variations of the MDS model was to ļ¬nd a model with the
optimal "goodness of ļ¬t," (GoF) for each of the above-mentioned data sets. Goodness of ļ¬t is a
measure of how well the MDS model ļ¬ts the original data based on a choice of MDS dimensions and
Minkowski powers. For each of the diļ¬€erent customer activities (items, sales, and customers), and
for the two diļ¬€erent normalization methods by hour and by department (or by day), we evaluated
how changing the MDS dimension and changing the Minkowski power aļ¬€ected the goodness of ļ¬t
of our model. We considered each of the MDS dimensions between 1 and 9 because our model
contained 10 departments. As the dimension of our MDS model is increased we expected to see the
goodness of ļ¬t increase accordingly. We also considered the 3 most common Minkowski powers,
p = 1 which corresponds to the Manhattan distance or 1-norm, p = 2 which corresponds to the
Euclidean distance or 2-norm, and p = āˆž which corresponds to the maximum distance or inļ¬nity
norm.
We can use R to ļ¬nd the GoF data in a similar fashion to how we obtained our original model.
The entire code is included below:
11
library ( wordcloud )
items <āˆ’ read . csv ( f i l e = "ItemsDayLabel . csv " , head = TRUE, sep = " , " )
d <āˆ’ d i s t ( items , method = " e u c l i d i a n " ) # 2āˆ’norm
# d <āˆ’ d i s t ( items , method = "manhattan ") # 1āˆ’norm
# d <āˆ’ d i s t ( items , method = "maximum") # sup norm
cmdscale (d , k = 1 , eig=TRUE)$GOF # k i s the dimension
We can choose between one of the three distance measures depending on which norm we are testing.
Similarly, we can use the following command to change dimensions:
cmdscale (d , k = 1 , eig=TRUE)$GOF
This command returns a goodness of ļ¬t value between 0 and 1, where a value of 1 indicates a per-
fect ļ¬t, or direct correlation, and a value of 0 indicates uniform randomness. k = 1 corresponds to
the dimension of our data, which we let range from 1 to n āˆ’ 1 = 9 where n = 10 is the dimension
of our data (i.e. the number of departments). The results are presented in the graphs below:
For the customer data, we can see that a Minkowski power of 1, Manhattan, seems to produce
models with the best goodness of ļ¬t over most MDS dimensions. In other words, the red line
is consistently higher than rest. Next, we are interesting in ļ¬nding the lowest MDS dimension
that suļ¬ƒciently models the data. For the customer by department (or day) data, we see that
a dimension of 1 leads to a GoF of about 0.46. While this is acceptable in some situations, we
also noticed that by raising the dimension of our MDS model to 2, our GoF is increased to 0.78.
Therefore, to optimize our MDS model for this particular data set, we chose a Minkowski power
of 1 and an MSD dimension of 2. On the other hand, if we examine the customers by hour plot,
we can see that a Manhattan Minkowski in a 1-D MDS model produces a goodness of ļ¬t of 0.72.
Therefore, this particular set of choices is suļ¬ƒcient in capturing the inherent trends present within
our original data set.
12
We noticed a similar trend in the items and sales data. The Manhattan Minkowski distance
calculation, i.e. p = 1, seems to produce the best GoF over most of the MDS dimensions between 1
and 9. Examining the plots of items and sales by department, we see that 1-D MDS models do not
suļ¬ƒciently encapsulate the multi-dimensional interactions present in these data sets, producing a
GoF of 0.45 and 0.51 respectively. However, if we examine the GoF of ļ¬t for these data sets in a
2-D MDS model, 0.77 and 0.76 respectively, we can see that there is a signiļ¬cant increase in the
GoF indicating that a 2-D MDS model is a signiļ¬cantly better ļ¬t for these data sets. Additionally,
if we examine the items and sales per hour, we notice that the 1-D Manhattan MDS models seems
to be suļ¬ƒcient for modeling the original data set, producing a GoF of 0.76 and 0.81 respectively.
Goodness of ļ¬t tables for each of these data sets are presented in the Appendix.
5.2 Changing the Dimension
In formulating our problem, we made the assumption that our one day of data is meaningful in the
larger scheme of business at QFC. Although no single day can be indicative of the general patterns
at the store, we are working under the assumption that there are some trends present in our data
that may provide insight into the store in general. We could improve our model by obtaining more
data from QFC, at which point we may be able to have more evidence that any relationships we
ļ¬nd between departments are accurate. However, we would need a lot of data over a long period
of time in order to proceed in this manner. Seeing as how this data is probably very valuable to
the company and how diļ¬ƒcult it was for us to obtain a single dayā€™s worth of data, this is not a
practical way forward.
As we saw in Section 5.1, calculating our plots using one dimension and the Manhattan distance
seemed to produce a high enough goodness of ļ¬t. So, we can perform our scaling again in 1D rather
than 2D in an attempt to remove the excess dimension we saw in our original results. As has been
the case so far, we expect this relationship to depend on the time of day at which each department
does the most business. In any case, we can alter our R code slightly to reļ¬‚ect this change in our
model:
library ( readr )
13
library ( wordcloud )
items <āˆ’ read . csv ( f i l e = " SalesHourLabel . csv " , head = TRUE, sep = " , " )
d <āˆ’ d i s t ( items , method = "manhattan" )
l l <āˆ’ cmdscale (d , k = 1)
# Column of zeros used to p l o t a l i n e in one dimension
textplot ( l l , c (0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0) , items [ , 1 ] , yaxt = ā€™n ā€™ , ann = FALSE)
If we now compare our raw data to our 1D representations, we see a stronger relationship between
the dimension and the data itself. Consider ļ¬rst the number of items sold over the course of each
hour:
The fresh produce department and the dairy department sell the most items at their peaks. This
is reļ¬‚ected in the plot as those two departments are the furthest away from the rest. In fact, if we
go through the lines from top to bottom in the plot on the left, we will see that this is exactly the
order in which the departments appear from right to left in the second plot. We can see the same
relationship reļ¬‚ected in the plots for sales per hour and customers per hour:
14
To verify this trend, also notice that fresh produce department has the highest peak in the sales
data and is further to the right than the dairy department. Similarly, in the customer data, the
dairy department is further to the right of the fresh produce department because the amount of
customers served in between the hours of 4 p.m. and 5 p.m. is greater. So, the distances in our
scaled plots seem to correspond to the height of each peak between 4 p.m. and 5 p.m., which
provides insight into the maximum activity at what is the busiest hour at QFC.
As was the case in the original 2D MDS plots of the data normalized by daily totals, there
seems to be something unique about the sushi department in the 1D representations. In particular,
this relationship is not immediately obvious from the raw data itself. We can ļ¬rst compare the
normalized plot of customers served over the course of the day as compared to the 1D representa-
tion of the departments:
What stands out in the plot on the right is the fact that the coļ¬€ee shop and the sushi department
are the furthest apart. When we look at the plot on the left, we notice that the coļ¬€ee shop serves
the highest percentage of its total customers early in the day. In particular, it serves the highest
percentage of any department between 10 a.m. and 11 a.m., while the sushi department serves
none. We know that this particular hour, rather than any of the other morning hours, accounts for
the distances in the 1D plot because of the seafood department. That is, the seafood department
does not serve its ļ¬rst customer until this hour, and it is closer to the rest of the departments than
to the sushi department. If the plot were reļ¬‚ecting the diļ¬€erences at an earlier time, then the
seafood department would presumably be right next to the sushi department since neither serves
a customer. This relationship is again apparent in the other two datasets:
15
5.3 Takeaways
We have seen that a 1D representation of our data is the most ļ¬tting when it has been normalized
by hourly totals. The GoF values for these three datasets are reasonably high, and the resulting
plots accurately reļ¬‚ect the peak activity in each department at the busiest hour. This information
can be useful in planning how to organize a store when the most business is being done.
Conversely, 2D representations of the data when normalized by daily totals seem to be more
useful than the 1D plots. While the 2D plots have one dimension relating to activity at certain
time periods throughout the day (e.g. breakfast time, lunch time, and dinner time) and another
relating to the business of departments at one particular hour, the 1D plots only give us insight
into the latter. This information is ultimately not helpful in coming to any meaningful conclusions
about the activity patterns in each department because of the fact that we only have data from
one day. Despite the superļ¬‚uous second dimension, the 2D plots still have one useful dimension,
whereas the 1D plots do not have any.
Hence, we can best utilize our data to evaluate peak traļ¬ƒc between 4 p.m. and 5 p.m. by
normalizing by hourly totals and comparing one-dimensional representations of the departments.
Additionally, we can see broad trends in business by normalizing our data by daily totals and
representing it in two dimensions. In order to verify the apparent trends, though, we would still
need to obtain a larger dataset.
6 Conclusion
6.1 Object of study
From the result we got above, we can see several departments are similarly busy at the same time,
such as the meat, fresh produce and seafood departments. To avoid congestion in some parts of
the grocery store and to maximize the possibility of money customers would spend in the store,
the store owner is better to separate these departments.
6.2 Limitations
First of all, we only have data for one particular day in that store. This would deļ¬nitely generate
some bias on our data and thus make our model less credible. Also, our data are statistics from
the register point of view. What we have are the actual purchases in each department, which is
only part of the customer behavior. Further, in reality the arrangements of departments could not
be ļ¬‚exible. They could be restricted by the locations of warehouses or workbenches. For instance,
the sushi department needs a workbench to make fresh sushi every day; a department containing
heavy items would prefer somewhere close to its warehouse; a coļ¬€ee shop would deļ¬nitely be close
to the entrance or exit. For these departments, the location and size are predetermined at the
point of the construction of the store.
6.3 Future work
In terms of modeling busyness of departments, we are currently based on number of customers,
number of items sold and revenue in each department. There are basic observations from the
registers. What happens before people checking out would also be worth considering. If possible,
16
we could collect data on the time an average customer spent in each department, regardless of
whether he/she buys something in that department. Similarly, the number of customers physically
appeared in each department also measures the busyness of that department.
In terms of generating a layout the maximizes the sales in the store, there are many aspects
worth deeper discussions. In addition to locations of diļ¬€erent departments, we could take sizes
of departments into consideration. Detailed placements and sizes of aisles, shelves and items on
each shelf would also have signiļ¬cant impact on sales in the store. This would be more realistic
since it is easier to make changes on them than on the predetermined locations of departments.
Completely diļ¬€erent models and more complicated modeling methods would be required to identify
the interrelationships of locations and sizes between diļ¬€erent aisles and shelves.
17
References
[1] Richardson, M. W. (1938). Psychological Bulletin, 35, 659-660
[2] Torgerson. W. S. (1952). Psychometrika. 17. 401-419. (The ļ¬rst major MDS breakthrough.)
[3] Young. F. W. (1984). Research Methods for Multimode Data Analvsis in the Behavioral
Sciences. H. G. Law, C. W. Snyder, J. Hattie, and R. P. MacDonald, eds. (An advanced
treatment of the most general models in MDS. Geometrically oriented. Interesting political
science example of a wide range of MDS models applied to one set of data.)
[4] Machado JT, Mata ME (2015) Analysis of World Economic Variables Using Multidimensional
Scaling. PLOS ONE 10(3): e0121277
http://dx.doi.org/10.1371/journal.pone.0121277
[5] Anand S, Sen A. The Income Component of the Human Development Index. Journal of
Human Development. 2000;1
http://dx.doi.org/10.1371/journal.pone.0121277
[6] World Development Indicators, The World Bank,Time series,17-Nov-2016
http://data.worldbank.org/data-catalog/world-development-indicators
[7] Wikipedia contributors. "Minkowski distance." Wikipedia, The Free Encyclopedia. Wikipedia,
The Free Encyclopedia, 1 Nov. 2016. Web. 1 Nov. 2016.
https://en.wikipedia.org/w/index.php?title=Minkowski_distance&oldid=747257101
[8] Editor of Real Simple. "The Secrets Behind Your Grocery Storeā€™s Layout." Real Simple. N.p.,
2012. Web. 29 Nov. 2016.
http://www.realsimple.com/food-recipes/shopping-storing/
more-shopping-storing/grocery-store-layout
[9] Boros, P., FehƩr, O., Lakner, Z. et al. Ann Oper Res (2016) 238: 27. doi:10.1007/s10479-015-
1986-2.
http://link.springer.com/article/10.1007/s10479-015-1986-2
[10] Li, Chen. "A FACILITY LAYOUT DESIGN METHODOLOGY FOR RETAIL ENVIRON-
MENTS." D-Scholarship. N.p., 3 May 2010. Web. 29 Nov. 2016.
http://d-scholarship.pitt.edu/9670/1/Dissertation_ChenLi_2010.pdf
[11] Ozgormus, Elif. "Optimization of Block Layout for Grocery Stores." Auburn University. N.p.,
9 May 2015. Web. 29 Nov. 2016.
https://etd.auburn.edu/bitstream/handle/10415/4494/Eozgormusphd.pdf;sequence=2
18
Appendix
Link to Google Drive:
https://drive.google.com/drive/folders/0B-8II7_BkXIbTmZ0aEREQ2RzSzA?usp=sharing
A.1 Raw data
The printout of data we got from QFC. There are total 10 pages, 1 page for each department.
19
20
21
22
23
24
25
26
27
28
29
We converted the data in an excel table:
Plot of items sold over the course of the day:
30
Plot of sales made over the course of the day:
Plot customers served over the course of the day:
31
32
A.2 Normalized Data
Normalized-by-hour spreadsheet:
Items normalized by hour:
33
Sales normalized by hour:
Customers normalized by hour:
Normalized-by-day spreadsheet:
34
35
Items normalized by day:
36
Sales normalized by day:
Customers normalized by day:
37
38
A.3 2D MDS Results
2D plot of items sold, normalized by day:
2D plot of sales made, normalized by day:
2D plot of customers served, normalized by day:
39
2D plot of items sold, normalized by hour:
2D plot of sales made, normalized by hour:
40
2D plot of customers served, normalized by hour:
41
A.4 1D MDS Results
1D plot of items sold, normalized by day:
1D plot of sales made, normalized by day:
1D plot of customers served, normalized by day:
42
1D plot of items sold, normalized by hour:
1D plot of sales made, normalized by hour:
43
1D plot of customers served, normalized by hour:
44
A.4 Goodness of Fit Tables
Table 1: Customers by Dept. (or Day)
GOF Euclidian Manhattan Supremum
1 0.3963199 0.4584881 0.3978179
2 0.7600668 0.7807252 0.719283
3 0.8587553 0.8875351 0.866924
4 0.9253756 0.93352 0.9277296
5 0.9536219 0.9620621 0.9548165
6 0.9788168 0.9811779 0.9647046
7 0.9909951 0.99114 0.9672805
8 0.9963022 0.9968744 0.9672805
9 1 0.9968744 0.9672805
Table 2: Customers by Hour
GOF Euclidian Manhattan Supremum
1 0.5850668 0.7160555 0.4549306
2 0.776274 0.8507424 0.6912069
3 0.8732725 0.9249754 0.8543061
4 0.9398181 0.9695998 0.920125
5 0.9792807 0.9774312 0.9595707
6 0.9920085 0.9774312 0.9745372
7 0.9962814 0.9774312 0.9773711
8 0.9983624 0.9774312 0.9773711
9 1 0.9774312 0.9773711
Table 3: Items by Dept. (or Day)
GOF Euclidian Manhattan Supremum
1 0.4293083 0.4496683 0.4001874
2 0.7539912 0.7742156 0.6793937
3 0.8888276 0.9034419 0.8202917
4 0.933347 0.9557358 0.8875166
5 0.9597362 0.9777854 0.9168588
6 0.9839133 0.9903894 0.9333318
7 0.9932868 0.9957496 0.9356707
8 0.9976301 0.9989588 0.9356707
9 1 0.9989588 0.9356707
45
Table 4: Items by Hour
GOF Euclidian Manhattan Supremum
1 0.6427679 0.757986 0.4270805
2 0.8166962 0.8722931 0.6788556
3 0.9056558 0.9410536 0.8985873
4 0.9580441 0.9824214 0.9485446
5 0.9869886 0.9876273 0.9670813
6 0.9944574 0.9888287 0.9815083
7 0.9974807 0.9892658 0.9870639
8 0.9991767 0.9892658 0.9870639
9 1 0.9892658 0.9870639
Table 5: Sales by Dept. (or Day)
GOF Euclidian Manhattan Supremum
1 0.4846022 0.5108597 0.4146601
2 0.7679321 0.7631536 0.6745012
3 0.8678356 0.8797308 0.8164288
4 0.9385633 0.9405769 0.9091716
5 0.9595338 0.971421 0.9468887
6 0.9772032 0.9925368 0.9649131
7 0.990547 0.9976865 0.9729114
8 0.9965854 0.9997359 0.9729114
9 1 0.9997359 0.9729114
Table 6: Sales by Hour
GOF Euclidian Manhattan Supremum
1 0.5269157 0.6498416 0.3607062
2 0.7274713 0.8119829 0.5933631
3 0.8401425 0.8910053 0.7961126
4 0.9072517 0.9563971 0.8790723
5 0.953552 0.9757764 0.9359311
6 0.9754582 0.9857487 0.9747406
7 0.9928391 0.9885393 0.9834409
8 0.9971796 0.9885393 0.9850726
9 1 0.9885393 0.9850726
46

More Related Content

Similar to Grocery Store Classification Model

Assessment of most selling staples & FMCG products in mop & pop stores close ...
Assessment of most selling staples & FMCG products in mop & pop stores close ...Assessment of most selling staples & FMCG products in mop & pop stores close ...
Assessment of most selling staples & FMCG products in mop & pop stores close ...
BHOMA RAM
Ā 
Running head MARKETING STRATEGY .docx
Running head MARKETING STRATEGY                                  .docxRunning head MARKETING STRATEGY                                  .docx
Running head MARKETING STRATEGY .docx
charisellington63520
Ā 
Palmer2e ch01
Palmer2e ch01Palmer2e ch01
Palmer2e ch01
ni3gogna1
Ā 
Business_Research_Project
Business_Research_ProjectBusiness_Research_Project
Business_Research_Project
zain Lala
Ā 
Running head E-Grocery business model ā€“ group project1E-GROcer.docx
Running head E-Grocery business model ā€“ group project1E-GROcer.docxRunning head E-Grocery business model ā€“ group project1E-GROcer.docx
Running head E-Grocery business model ā€“ group project1E-GROcer.docx
todd271
Ā 

Similar to Grocery Store Classification Model (20)

Informe de PwC sobre las expectativas y hƔbitos de consumo del comprador online
Informe de PwC sobre las expectativas y hƔbitos de consumo del comprador onlineInforme de PwC sobre las expectativas y hƔbitos de consumo del comprador online
Informe de PwC sobre las expectativas y hƔbitos de consumo del comprador online
Ā 
Informes PwC - Encuesta Total Retail Global
Informes PwC - Encuesta Total Retail GlobalInformes PwC - Encuesta Total Retail Global
Informes PwC - Encuesta Total Retail Global
Ā 
The Future of Retail - Marketing and Merchandising Trend Report
The Future of Retail - Marketing and Merchandising Trend ReportThe Future of Retail - Marketing and Merchandising Trend Report
The Future of Retail - Marketing and Merchandising Trend Report
Ā 
Retail predictions 2014 -
Retail predictions 2014 -Retail predictions 2014 -
Retail predictions 2014 -
Ā 
Retail Dictionary: 40 Retail Terms Every Modern Retailer Needs To Know
Retail Dictionary: 40 Retail Terms Every Modern Retailer Needs To KnowRetail Dictionary: 40 Retail Terms Every Modern Retailer Needs To Know
Retail Dictionary: 40 Retail Terms Every Modern Retailer Needs To Know
Ā 
5 Factors Driving Complexity
5 Factors Driving Complexity5 Factors Driving Complexity
5 Factors Driving Complexity
Ā 
Assessment of most selling staples & FMCG products in mop & pop stores close ...
Assessment of most selling staples & FMCG products in mop & pop stores close ...Assessment of most selling staples & FMCG products in mop & pop stores close ...
Assessment of most selling staples & FMCG products in mop & pop stores close ...
Ā 
Running head MARKETING STRATEGY .docx
Running head MARKETING STRATEGY                                  .docxRunning head MARKETING STRATEGY                                  .docx
Running head MARKETING STRATEGY .docx
Ā 
Business Plan ICADDY et retail Analytics
Business Plan ICADDY et retail AnalyticsBusiness Plan ICADDY et retail Analytics
Business Plan ICADDY et retail Analytics
Ā 
Palmer2e ch01
Palmer2e ch01Palmer2e ch01
Palmer2e ch01
Ā 
Whitepaper on Customer Experience Management (CEM) perspectives in Retail (De...
Whitepaper on Customer Experience Management (CEM) perspectives in Retail (De...Whitepaper on Customer Experience Management (CEM) perspectives in Retail (De...
Whitepaper on Customer Experience Management (CEM) perspectives in Retail (De...
Ā 
Business_Research_Project
Business_Research_ProjectBusiness_Research_Project
Business_Research_Project
Ā 
Unwrapping Winning Holiday Strategies
 Unwrapping Winning Holiday Strategies  Unwrapping Winning Holiday Strategies
Unwrapping Winning Holiday Strategies
Ā 
Accenture-POV-06-Full-Report-Retail-Experience-Reimagined.pdf
Accenture-POV-06-Full-Report-Retail-Experience-Reimagined.pdfAccenture-POV-06-Full-Report-Retail-Experience-Reimagined.pdf
Accenture-POV-06-Full-Report-Retail-Experience-Reimagined.pdf
Ā 
Running head E-Grocery business model ā€“ group project1E-GROcer.docx
Running head E-Grocery business model ā€“ group project1E-GROcer.docxRunning head E-Grocery business model ā€“ group project1E-GROcer.docx
Running head E-Grocery business model ā€“ group project1E-GROcer.docx
Ā 
Code project report
Code project reportCode project report
Code project report
Ā 
AGRICULUTURAL MARKETING AND MANAGEMENT
AGRICULUTURAL MARKETING AND MANAGEMENTAGRICULUTURAL MARKETING AND MANAGEMENT
AGRICULUTURAL MARKETING AND MANAGEMENT
Ā 
Brandable newsletter for printers and mailers
Brandable newsletter for printers and mailersBrandable newsletter for printers and mailers
Brandable newsletter for printers and mailers
Ā 
Digital marketing strategy overview for dealers
Digital marketing strategy overview for dealersDigital marketing strategy overview for dealers
Digital marketing strategy overview for dealers
Ā 
Market segmentation
Market segmentationMarket segmentation
Market segmentation
Ā 

More from Gerard Trimberger (8)

UWUnofficialTranscript
UWUnofficialTranscriptUWUnofficialTranscript
UWUnofficialTranscript
Ā 
Optimal Vacation Itinerary Modeling
Optimal Vacation Itinerary ModelingOptimal Vacation Itinerary Modeling
Optimal Vacation Itinerary Modeling
Ā 
Climate Change Model
Climate Change ModelClimate Change Model
Climate Change Model
Ā 
Cellular Metabolism Model
Cellular Metabolism ModelCellular Metabolism Model
Cellular Metabolism Model
Ā 
Novel In Vivo Concentration Detector
Novel In Vivo Concentration DetectorNovel In Vivo Concentration Detector
Novel In Vivo Concentration Detector
Ā 
PID Temperature Controller
PID Temperature ControllerPID Temperature Controller
PID Temperature Controller
Ā 
Thermocouples therm diff comsol
Thermocouples therm diff comsolThermocouples therm diff comsol
Thermocouples therm diff comsol
Ā 
Ebola final project paper
Ebola final project paperEbola final project paper
Ebola final project paper
Ā 

Recently uploaded

Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
AnaAcapella
Ā 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
CaitlinCummins3
Ā 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Peter Brusilovsky
Ā 
Personalisation of Education by AI and Big Data - Lourdes GuĆ rdia
Personalisation of Education by AI and Big Data - Lourdes GuĆ rdiaPersonalisation of Education by AI and Big Data - Lourdes GuĆ rdia
Personalisation of Education by AI and Big Data - Lourdes GuĆ rdia
EADTU
Ā 
ĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinh
ĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinhĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinh
ĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinh
leson0603
Ā 

Recently uploaded (20)

An Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge AppAn Overview of the Odoo 17 Knowledge App
An Overview of the Odoo 17 Knowledge App
Ā 
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPSSpellings Wk 4 and Wk 5 for Grade 4 at CAPS
Spellings Wk 4 and Wk 5 for Grade 4 at CAPS
Ā 
MOOD STABLIZERS DRUGS.pptx
MOOD     STABLIZERS           DRUGS.pptxMOOD     STABLIZERS           DRUGS.pptx
MOOD STABLIZERS DRUGS.pptx
Ā 
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of TransportBasic Civil Engineering notes on Transportation Engineering & Modes of Transport
Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
Ā 
OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...OS-operating systems- ch05 (CPU Scheduling) ...
OS-operating systems- ch05 (CPU Scheduling) ...
Ā 
VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
Ā 
An overview of the various scriptures in Hinduism
An overview of the various scriptures in HinduismAn overview of the various scriptures in Hinduism
An overview of the various scriptures in Hinduism
Ā 
Including Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdfIncluding Mental Health Support in Project Delivery, 14 May.pdf
Including Mental Health Support in Project Delivery, 14 May.pdf
Ā 
UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024UChicago CMSC 23320 - The Best Commit Messages of 2024
UChicago CMSC 23320 - The Best Commit Messages of 2024
Ā 
Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"Mattingly "AI and Prompt Design: LLMs with NER"
Mattingly "AI and Prompt Design: LLMs with NER"
Ā 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
Ā 
How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17How to Send Pro Forma Invoice to Your Customers in Odoo 17
How to Send Pro Forma Invoice to Your Customers in Odoo 17
Ā 
SURVEY I created for uni project research
SURVEY I created for uni project researchSURVEY I created for uni project research
SURVEY I created for uni project research
Ā 
8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management8 Tips for Effective Working Capital Management
8 Tips for Effective Working Capital Management
Ā 
How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17How To Create Editable Tree View in Odoo 17
How To Create Editable Tree View in Odoo 17
Ā 
diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....diagnosting testing bsc 2nd sem.pptx....
diagnosting testing bsc 2nd sem.pptx....
Ā 
Đį»€ THAM KHįŗ¢O Kƌ THI TUYį»‚N SINH VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH FORM 50 CƂU TRįŗ®C NGHI...
Đį»€ THAM KHįŗ¢O Kƌ THI TUYį»‚N SINH VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH FORM 50 CƂU TRįŗ®C NGHI...Đį»€ THAM KHįŗ¢O Kƌ THI TUYį»‚N SINH VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH FORM 50 CƂU TRįŗ®C NGHI...
Đį»€ THAM KHįŗ¢O Kƌ THI TUYį»‚N SINH VƀO Lį»šP 10 MƔN TIįŗ¾NG ANH FORM 50 CƂU TRįŗ®C NGHI...
Ā 
SPLICE Working Group: Reusable Code Examples
SPLICE Working Group:Reusable Code ExamplesSPLICE Working Group:Reusable Code Examples
SPLICE Working Group: Reusable Code Examples
Ā 
Personalisation of Education by AI and Big Data - Lourdes GuĆ rdia
Personalisation of Education by AI and Big Data - Lourdes GuĆ rdiaPersonalisation of Education by AI and Big Data - Lourdes GuĆ rdia
Personalisation of Education by AI and Big Data - Lourdes GuĆ rdia
Ā 
ĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinh
ĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinhĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinh
ĐeĢ‚Ģ€ tieng anh thpt 2024 danh cho cac ban hoc sinh
Ā 

Grocery Store Classification Model

  • 1. Math 381 Project Two Group 9 Alex Forney Keren Lai Gerard Trimberger Xinyu Zhou December 7, 2016 1
  • 2. 1 Introduction When we buy products in grocery store, we ļ¬nd the things we want to buy are usually not located near each another, and it is common to ļ¬nd that one part of the store is crowded while others have few customers. This may be because store managers or other higher-ups plan the store layout while taking into consideration the similarities of productsā€™ sales. He/she may place items often purchased together in locations farther apart in the store. So, customers may need to stay in the store longer, resulting in these customers seeing more items and potentially purchasing them. Another added beneļ¬t may be the reduction of congestion is departments with popular items. In our project, we seek to ļ¬nd the relationships between diļ¬€erent departments of a grocery store using multidimensional scaling (MDS). We will plot the activity of 10 diļ¬€erent departments (Packaged Produce, Deli, Bakery, Dairy, Meat, Dry goods, Fresh Produce, Coļ¬€ee shop, Seafood, and Sushi) in order to show the similarities and diļ¬€erences between them. The result of our study may provide insight into the planning of grocery stores and/or customer habits. 2 Background 2.1 Idea We began the brainstorming process by each formulating a list of topics that we were interested in, both mathematically and socially. We also created a list of our individual skill sets and experience that we felt was relevant to the project. We then spent time reading through each of our responses to get an idea of what type of project we could all ļ¬nd interesting. We all agreed that we wanted to do something related to a common situation that most people experience on a daily basis. It is always more interesting if people can directly relate to the project rather than working on something that they do not have personal experience with. Our second criteria, was that we each wanted to do something related to probabilities or Monte Carlo simulation. Keren and Xinyu are ACMS/Economics double majors so they were both interested in the processes involved in economic development. Our ļ¬rst formulation of the proposal involved comparing the total sales and overall market share of diļ¬€erent car manufacturers. We wanted to build a Markov chain of diļ¬€erent manufacturer states and how they relate, in order to predict how the current market share distribution would change over time. Ultimately, we felt that we would be unable to obtain the necessary data for an interesting Markov chain, i.e. the number or probability of a car owner moving from one manufacturer or another. Other outside factors, such as owning multiple cars, created additional problems that we eventually felt would hinder our progress. At this point, we decided to switch gears. While keeping the original overarching goals in mind, speciļ¬cally a publicly relatable problem and something probability/simulation based, we formulated a new proposal that involved simulating a grocery store checkout process. We planned on contacting a local grocery store for real-life customer and item distribution data. Gerard went into his local QFC on Friday, November 18th. He asked to speak with the manager of the store, and presented the situation to her, asking speciļ¬cally if we could obtain some data for customer checkout times, their number of items, and what types of register (Normal, Express, or Self- Checkout) that they utilized to make their purchase. The manager suggested that he call back on Saturday (11/19) when the bookkeeper was present, because the bookkeeper is the one with access to that type of information. When Gerard called back on 11/19 he was informed that the bookkeeper had called in sick, and that he would either have to call back on Monday or to try a diļ¬€erent store. The manager provided a phone number to another store in the region that had their bookkeepers present on 11/19. Gerard followed through with this lead and presented the situation to the other store manager. This new store manager did not seem to comprehend the issue and advised Gerard to contact QFC Corporate for more information. Gerard then called the Corporate phone number provided and left a message on their answering machine informing them that we would like to talk as soon as possible. Gerard waited until Monday morning (11/21), and when he had not heard back from corporate, decided to contact the manager at the local QFC once again. This time he was able to speak directly to the bookkeeper of the store, and conļ¬rmed that there was customer data available in the computer system but that it may not be exactly what we were looking for. He provided his name and number and was told that if he did not hear back from the store later that day, to come in on Tuesday (11/22). Gerard did not receive a call 2
  • 3. during this time, so on Tuesday morning around 10 am he went in to the local QFC in person to observe the situation ļ¬rsthand. Upon speaking to the manager, she led Gerard into the backroom of the store and introduced him to her bookkeeper. From this point, Gerard worked directly with the bookkeeper to obtain data that he felt could be useful to our project. Gerard was able to obtain an hour by hour breakdown of the activity (i.e. item count, sales amount, and customer count) of each of the 10 departments of the store (packaged produce, deli, bakery, dairy, meat, dry goods, fresh produce, seafood, coļ¬€ee, and sushi). Unfortunately, this was not the data that we had originally intended on receiving for our grocery store checkout simulation, but that did not mean that it wasnā€™t useful. We met up as a team and discussed how we wanted to move forward with this new information. We brainstormed a proposal for a new project that we could formulate, based on the data that we were provided. We settled on creating an MDS model comparing the diļ¬€erent departments on an hour by hour basis, based on their normalized distributions for each indicator. The details of the model are explained below. 2.2 Similar Modelings Multidimensional scaling (MDS) is a set of data analysis techniques that display the structure of distance-like data as a geometrical picture. Evolving from the work of Richardson, [1] Torgerson proposed the ļ¬rst MDS method and coined the term.[2]. MDS is now a general analysis technique used in a wide variety of ļ¬elds, such as marketing, sociology, economies etc. In 1984, Young and Hamer published a book on the theory and applications of MDS, and they presented applications of MDS in marketing. [3] J.A. Tenreiro Machado and Maria Eugenia Mata from Portugal analyzed the world economic variables using multidimensional scaling[4] that is similar as we do. Tenreiro and Mata analyze the evolution of GDP per capita,[5] international trade openness, life expectancy and education tertiary enrollment in 14 countries from 1977 up to 2012[6] using MDS method. In their study, the objects are country economies characterized by means of a given set of variables evaluated during a given time period. They calculated the distance between i-th and j-th objects by taking diļ¬€erence of economic variables for them in several years period. They plot countries on the graph and distinguish countries by multiple aspects like human welfare, quality of life and growth rate. Tenreiro and Mata concluded from the graphs that the analysis on 14 countries over the last 36 years under MDS techniques proves that a large gap separates Asian partners from converging to the North-American and Western-European developed countries, in terms of potential warfare, economic development, and social welfare. The modeling Tenreiro and Mata use is similar as we do. In our projects, the objects are departments in grocery store. They studied the diļ¬€erence/similarity between country economies through years, while we study the diļ¬€erence/similarity between diļ¬€erent departments through hours in a day. In Tenreiro and Mataā€™s research, the countries developed at the same time are close on the graphs; in our study, the store departments that are busy at the same time are close on the graphs. However, the database of our project is much smaller than theirs. We compared departments from the data of the number of items sale, customersā€™ number and the total amount sale at a given time period. Tenreiro and Mataā€™ s data is more dimensional, from GDP per capita, economic openness, life expectancy, and tertiary education etc. And also our project studies similarity of busyness from another side: percentage of each department sale at the given hour. 2.3 Similar Problems The objective of our project is to help the grocery store owner to plan the layout of diļ¬€erent blocks of store and increase storeā€™s sale by ļ¬nding the interrelationships of busyness between products from diļ¬€erent departments. The problem of how to layout a grocery store to maximize the purchases of the average customer is discussed in many works, through both aspects of merchandising and mathematics. As mentioned by one article, grab-and-go items such as bottled water and snacks should be placed near the entrance; Deli and Coļ¬€ee Bar should be placed in one of the front corners to attract hungry customers; Cooking Ingredients, and Canned Goods should be placed in the center aisles to draw customers to walk deeper and shop through nonessential items.[8] There are also many economists and mathematicians working on similar problems. In the paper written by Boros, P., FehĆ©r, O., Lakner, Z., traveling salesman problem (TSP) was used to maximize the shortest walking distance 3
  • 4. for each customer according to diļ¬€erent arrangements of the departments in the store.[9] The results showed that the total walking distances of customers increased in the proposed new layout.[9] Chen Li from University of Pittsburgh modeled the department allocation design problem as a multiple knapsack problem and optimized the adjacency preference of departments to get possible maximum exposure of items in the store, and try to give out an eļ¬€ective layout.[10] Similar optimization was used in the paper by Elif Ozgormus from Auburn University.[11] To access the revenue of the store layout, she used stochastic simulation and classiļ¬ed departments in to groups where customers often purchase items from them concurrently.[11] By limiting space, unit revenue production and department adjacency in the store, she optimized the impulse purchase and customer satisfaction to get a desired layout.[11] All three papers have similar basic objectives to ours. The paper by Boros et al. was aiming to maximize the total walking distance of each customer and thus promote sales of the store.[9] Liā€™s paper also focused on proļ¬t maximization but with considerations of the exposure of the items and adjacencies between departments.[10] He is the ļ¬rst person to incorporate aisle structure, depart- ment allocation, and departmental layout together into a comprehensive research.[10] The paper by Ozgormus took revenue and adjacency into consideration and worked on the model speciļ¬cally for grocery stores towards the objectives of maximizing revenue and adjacency satisfaction.[11] In our paper, we simply focus on the busyness of diļ¬€erent departments and use multidimensional scaling to model the similarities between each department and thus provide solid evidence for designing an eļ¬ƒcient and proļ¬table layout. Instead of having data on comprehensive customer behavior in the store, we have data of sales from the register point of view. 3 The Model As a result of the data acquisition process described in the Background section, we were able to obtain an hourly breakdown of the number of items, total sales, and number of customers that purchase items from the local QFC that we collected from. The data presents a 24-hour snapshot of a standard day within the grocery store. The data was presented in individual printouts of each departmentā€™s activity for the day, therefore the ļ¬rst step was to transcribe all of the information from physical paper form onto an Excel spreadsheet. The results are presented in the Appendix. The next step was to separate and normalize each of the diļ¬€erent activity indicators based on their departmental, as well as hourly, totals. In this way, we transformed the raw data into standardized distributions whose area under the curve summed to one. Speciļ¬cally, we separated the data into three diļ¬€erent 24 Ɨ 10 matrices (i.e. items, sales, and customers), where the rows of the matrix represent the hourly data for a 24-hour time period and the columns represent the each of the 10 departments. For each of these matrices we normalized each entry by their daily departmental totals, i.e. for each department (or column) we divided each entry in the column by the summed total of the column: MATLAB Code: for i = 1:10 items_normD(:,i) = items_raw(:,i)/sum(items_raw(:,i)); sales_normD(:,i) = sales_raw(:,i)/sum(sales_raw(:,i)); cust_normD(:,i) = cust_raw(:,i)/sum(cust_raw(:,i)); end Additionally, we normalized each of the 24 rows (hourly data) by the row sum of the activity for that particular hour throughout all departments: MATLAB Code: for i = 1:24 items_normH(i,:) = items_raw(i,:)/sum(items_raw(i,:)); sales_normH(i,:) = sales_raw(i,:)/sum(sales_raw(i,:)); cust_normH(i,:) = cust_raw(i,:)/sum(cust_raw(i,:)); end These calculations were performed on a mid-2010 Macbook Pro, running Windows 7 - SP1, in MATLAB R2016b Student edition. The calculations were instantaneous. The result of this nor- malization process resulted in 6 diļ¬€erent datasets of customer activity, i.e. the number of items, 4
  • 5. sales, and the number of customers each normalized by their daily departmental totals and addition- ally by their hourly store totals. We ran each of these data sets through the distance calculations, described below, in order to generate diļ¬€erent variations of the information, ultimately in search of the best ā€œgoodness of ļ¬t.ā€ In order to create an MDS model of the above mentioned data sets, our next step was to run each data sets through our distance algorithm in order to calculate a single dimensional distance between diļ¬€erent departments. In other words, we iterated through each of the departments, a, and compared them to each of the other departmentā€™s, b, hourly customer activity. We utilized the Minkowski distance formula for our distance calculations [7]: distance = 24 i=1 |ra,i āˆ’ rb,i|p 1 p where, i represents the hourly time period (e.g. i = 1 represents 12 oā€™clock AM to 1 oā€™clock AM), a and b represent each of the diļ¬€erent departments, and p represents the power of the Minkowski algorithm. The most common powers, p, that are considered are powers of 1, 2, and āˆž. A power of 1 is commonly referred to as the Manhattan distance, a power of 2 is commonly referred to as the Euclidean distance, and power āˆž is commonly referred to as Supremum distance. We used R version 3.3.2 on a Late 2013 MacBook Pro running macOS 10.12.1 to carry out our calculations, which ran instantly. Speciļ¬cally, we ran the following commands in R: library ( readr ) library ( wordcloud ) items <āˆ’ read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " ) d <āˆ’ d i s t ( items , method = " e u c l i d i a n " ) l l <āˆ’ cmdscale (d , k = 2) textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE) Step-by-step, here is what the commands do: library ( readr ) library ( wordcloud ) These commands import libraries that allow us to read the CSV ļ¬le and create the plot. items <āˆ’ read . csv ( f i l e = "ItemsHourLabel . csv " , head = TRUE, sep = " , " ) This command reads in the formatted 24-dimensional vectors corresponding to each department from the ļ¬le ā€œItemsHourLabel.csvā€ into a table called ā€œitemsā€. The ļ¬le ā€œItemsHourLabel.csvā€ con- sists of rows that look like this: Department,00:00 - 01:00,01:00 - 02:00,02:00 - 03:00,03:00 - 04:00,... Packaged Produce,0,0,0,0.011299,0,0,0.022599,0.00565,0.022599,... Deli,0.006135,0,0,0,0,0.02454,0.006135,0.02454,0.018405,0.02454,... Bakery,0.001661,0,0,0,0,0.021595,0.019934,0.059801,0.043189,... . . . In this case, each row represents the number of items sold in each department in a given hour divided by the total number of items sold in the department over the course of the day. The department names at the beginning of each row are used for the graphic output. d <āˆ’ d i s t ( items , method = " e u c l i d i a n " ) This command takes the table ā€œitemsā€ and creates a matrix of distances between every row of the table. Here, the distance method is speciļ¬ed as ā€œeuclidianā€, which means that the distance between 5
  • 6. row i and row j will be calculated as dij = 24 i=1 |ra,i āˆ’ rb,i| 2 . l l <āˆ’ cmdscale (d , k = 2) Here, the k = 2 speciļ¬es a two-dimensional model. The output is a list of two-dimensional coor- dinates, one for each object in the original set: > head(ll, 10) [,1] [,2] [1,] -0.032088329 0.01770756 [2,] -0.027631806 0.02097795 [3,] -0.028511119 0.05441644 [4,] -0.013549396 -0.01713736 [5,] -0.086806729 -0.06648990 [6,] -0.007476898 -0.01173682 [7,] -0.010818238 -0.02144684 [8,] -0.001610913 0.18130208 [9,] -0.045186100 -0.12261632 [10,] 0.253679528 -0.03497679 textplot ( l l [ , 1 ] , l l [ , 2 ] , items [ , 1 ] , ann = FALSE) This command plots the result with the names of the departments. ll[,1], ll[,2] speciļ¬es that the ļ¬rst column of ll gives the x-coordinates and the second column gives the y-coordinates. items[,1] speciļ¬es that the ļ¬rst column of the table ā€œitemsā€ gives the labels for the data points. ann = FALSE removes the x and y labels from the plot. The results of these commands are presented in the following section. 4 Results 4.1 Hourly In order to draw conclusions about the two-dimensional representation of our data, we can compare them to the original data after it has been normalized by the hourly store totals. The result of these datasets is the 2D plot of the items per hour: 6
  • 7. We immediately see that the dairy department and fresh produce department diļ¬€er from the rest of the data. Similarly, the coļ¬€ee shop and bakery diļ¬€er signiļ¬cantly. We then wish to ļ¬nd two diļ¬€erences in the data that may be causing the diļ¬€erences and can be used as the dimensions of our plot. A plot of the items sold over the course of the day in each department follows: We can see that the dairy department and fresh produce department both sell more than double any other department at their respective peaks, which occur at approximately the same time in the day. So, the horizontal dimension of our 2D representation of the data corresponds to this large peak between the hours of 12 p.m. and 8 p.m. This is further supported by the fact that the dry goods department and the bakery follow this trend to a lesser degree (less than dairy 7
  • 8. and fresh produce but more than the other departments), so they are closer to the right side of our plot. Nothing immediately stands out from the raw data that indicates that the coļ¬€ee shop and the bakery diļ¬€er from the rest of the departments in any meaningful way. We can in- stead look at the normalized data to see what may be the cause of this vertical distance in the plot: Here, we see that the coļ¬€ee shop and the bakery sell the majority of the total items sold in the store between about 6 a.m. and 9 a.m. This does seem to make sense, as many people may be purchasing coļ¬€ee and/or baked goods in the morning for breakfast. However, this second dimension tells us that the departments diļ¬€er in the times at which they are the most active, which we already knew from our ļ¬rst dimension and the fact that our data is separated by departments and time intervals. Consequently, this second dimension is not very useful. Examining the other two 2D plots of the data normalized by hourly totals, i.e. sales and cus- tomer count, leads to similar conclusions. That is, the axes of the plots are dependent on the times at which business activity spikes in each department. If we now consider the example of the sales, we see that the 2D representation is essentially the same as with the previous dataset: 8
  • 9. While the distances are altered slightly, the plot is otherwise simply inverted. The results for the customer data are very similar and are included in the Appendix. 4.2 Daily Similar to when the data was normalized by the hourly totals, the 2D representations of our data normalized by daily totals exhibits a relationship between departments that are busiest at the same times: For example, in the above plot of the items sold in each department, we see that the coļ¬€ee shop is far away from the seafood department. By looking at the raw data of the number of items sold per department over the course of the day (included above in this section), there does not seem to be anything contrasting the coļ¬€ee shop and seafood in any meaningful way. Instead, we can look directly at the normalized data: 9
  • 10. We can see that the coļ¬€ee shop is the busiest early in the day between 9 a.m. and 12 p.m. with another spike around 2 p.m. Conversely, the seafood department does the most business between 3 p.m. and 6 p.m. The rest of the departments, other than the sushi department, seem to increase their business steadily throughout the day and peak in the late afternoon. This leads us to the conclusion that one axis in our plots corresponds to the time at which each department does most of its business. However, there is also a second dimension that appears to depend only on sushi. Looking at the 2D representation of the sales over the course of the day, we again see this strange distance between the sushi department and the rest of the store: When we look at the raw sales data for the sushi department, the only aspects that stand out are the fact that the department does relatively little business and that the department only has three time periods when there are any transactions at all. There are two spikes around lunch time and again around dinner time, but there is another single sushi sale between midnight and 1 a.m. 10
  • 11. The "sushi dimension" could be a result of either the two periods of activity or the fact that the sushi department is one of the only departments to make a sale at the late hour. The former does not seem to be the case because all of the departments go through a rise and fall of sales over the course of a day. Alternatively, if the latter is true, the ā€œsushi dimensionā€ is not particularly interesting since we are only analyzing one dayā€™s worth of data and the single sale is more than likely not indicative of a trend of late night sushi purchases. In either case, the second dimension of our plot is not really helpful in determining the similarity of any two departments. So, we can perform another dimension reduction in order to create a one-dimensional model for our data. The plot of customer data was omitted from the discussion because of its similarity to the item and sales data sets. The results are presented in the Appendix. Our next step was to consider adjustments to the Minkowski powers and MDS dimensions in our model. 5 Adjustments and Extensions 5.1 Goodness of Fit Our ultimate goal in generating diļ¬€erent variations of the MDS model was to ļ¬nd a model with the optimal "goodness of ļ¬t," (GoF) for each of the above-mentioned data sets. Goodness of ļ¬t is a measure of how well the MDS model ļ¬ts the original data based on a choice of MDS dimensions and Minkowski powers. For each of the diļ¬€erent customer activities (items, sales, and customers), and for the two diļ¬€erent normalization methods by hour and by department (or by day), we evaluated how changing the MDS dimension and changing the Minkowski power aļ¬€ected the goodness of ļ¬t of our model. We considered each of the MDS dimensions between 1 and 9 because our model contained 10 departments. As the dimension of our MDS model is increased we expected to see the goodness of ļ¬t increase accordingly. We also considered the 3 most common Minkowski powers, p = 1 which corresponds to the Manhattan distance or 1-norm, p = 2 which corresponds to the Euclidean distance or 2-norm, and p = āˆž which corresponds to the maximum distance or inļ¬nity norm. We can use R to ļ¬nd the GoF data in a similar fashion to how we obtained our original model. The entire code is included below: 11
  • 12. library ( wordcloud ) items <āˆ’ read . csv ( f i l e = "ItemsDayLabel . csv " , head = TRUE, sep = " , " ) d <āˆ’ d i s t ( items , method = " e u c l i d i a n " ) # 2āˆ’norm # d <āˆ’ d i s t ( items , method = "manhattan ") # 1āˆ’norm # d <āˆ’ d i s t ( items , method = "maximum") # sup norm cmdscale (d , k = 1 , eig=TRUE)$GOF # k i s the dimension We can choose between one of the three distance measures depending on which norm we are testing. Similarly, we can use the following command to change dimensions: cmdscale (d , k = 1 , eig=TRUE)$GOF This command returns a goodness of ļ¬t value between 0 and 1, where a value of 1 indicates a per- fect ļ¬t, or direct correlation, and a value of 0 indicates uniform randomness. k = 1 corresponds to the dimension of our data, which we let range from 1 to n āˆ’ 1 = 9 where n = 10 is the dimension of our data (i.e. the number of departments). The results are presented in the graphs below: For the customer data, we can see that a Minkowski power of 1, Manhattan, seems to produce models with the best goodness of ļ¬t over most MDS dimensions. In other words, the red line is consistently higher than rest. Next, we are interesting in ļ¬nding the lowest MDS dimension that suļ¬ƒciently models the data. For the customer by department (or day) data, we see that a dimension of 1 leads to a GoF of about 0.46. While this is acceptable in some situations, we also noticed that by raising the dimension of our MDS model to 2, our GoF is increased to 0.78. Therefore, to optimize our MDS model for this particular data set, we chose a Minkowski power of 1 and an MSD dimension of 2. On the other hand, if we examine the customers by hour plot, we can see that a Manhattan Minkowski in a 1-D MDS model produces a goodness of ļ¬t of 0.72. Therefore, this particular set of choices is suļ¬ƒcient in capturing the inherent trends present within our original data set. 12
  • 13. We noticed a similar trend in the items and sales data. The Manhattan Minkowski distance calculation, i.e. p = 1, seems to produce the best GoF over most of the MDS dimensions between 1 and 9. Examining the plots of items and sales by department, we see that 1-D MDS models do not suļ¬ƒciently encapsulate the multi-dimensional interactions present in these data sets, producing a GoF of 0.45 and 0.51 respectively. However, if we examine the GoF of ļ¬t for these data sets in a 2-D MDS model, 0.77 and 0.76 respectively, we can see that there is a signiļ¬cant increase in the GoF indicating that a 2-D MDS model is a signiļ¬cantly better ļ¬t for these data sets. Additionally, if we examine the items and sales per hour, we notice that the 1-D Manhattan MDS models seems to be suļ¬ƒcient for modeling the original data set, producing a GoF of 0.76 and 0.81 respectively. Goodness of ļ¬t tables for each of these data sets are presented in the Appendix. 5.2 Changing the Dimension In formulating our problem, we made the assumption that our one day of data is meaningful in the larger scheme of business at QFC. Although no single day can be indicative of the general patterns at the store, we are working under the assumption that there are some trends present in our data that may provide insight into the store in general. We could improve our model by obtaining more data from QFC, at which point we may be able to have more evidence that any relationships we ļ¬nd between departments are accurate. However, we would need a lot of data over a long period of time in order to proceed in this manner. Seeing as how this data is probably very valuable to the company and how diļ¬ƒcult it was for us to obtain a single dayā€™s worth of data, this is not a practical way forward. As we saw in Section 5.1, calculating our plots using one dimension and the Manhattan distance seemed to produce a high enough goodness of ļ¬t. So, we can perform our scaling again in 1D rather than 2D in an attempt to remove the excess dimension we saw in our original results. As has been the case so far, we expect this relationship to depend on the time of day at which each department does the most business. In any case, we can alter our R code slightly to reļ¬‚ect this change in our model: library ( readr ) 13
  • 14. library ( wordcloud ) items <āˆ’ read . csv ( f i l e = " SalesHourLabel . csv " , head = TRUE, sep = " , " ) d <āˆ’ d i s t ( items , method = "manhattan" ) l l <āˆ’ cmdscale (d , k = 1) # Column of zeros used to p l o t a l i n e in one dimension textplot ( l l , c (0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0) , items [ , 1 ] , yaxt = ā€™n ā€™ , ann = FALSE) If we now compare our raw data to our 1D representations, we see a stronger relationship between the dimension and the data itself. Consider ļ¬rst the number of items sold over the course of each hour: The fresh produce department and the dairy department sell the most items at their peaks. This is reļ¬‚ected in the plot as those two departments are the furthest away from the rest. In fact, if we go through the lines from top to bottom in the plot on the left, we will see that this is exactly the order in which the departments appear from right to left in the second plot. We can see the same relationship reļ¬‚ected in the plots for sales per hour and customers per hour: 14
  • 15. To verify this trend, also notice that fresh produce department has the highest peak in the sales data and is further to the right than the dairy department. Similarly, in the customer data, the dairy department is further to the right of the fresh produce department because the amount of customers served in between the hours of 4 p.m. and 5 p.m. is greater. So, the distances in our scaled plots seem to correspond to the height of each peak between 4 p.m. and 5 p.m., which provides insight into the maximum activity at what is the busiest hour at QFC. As was the case in the original 2D MDS plots of the data normalized by daily totals, there seems to be something unique about the sushi department in the 1D representations. In particular, this relationship is not immediately obvious from the raw data itself. We can ļ¬rst compare the normalized plot of customers served over the course of the day as compared to the 1D representa- tion of the departments: What stands out in the plot on the right is the fact that the coļ¬€ee shop and the sushi department are the furthest apart. When we look at the plot on the left, we notice that the coļ¬€ee shop serves the highest percentage of its total customers early in the day. In particular, it serves the highest percentage of any department between 10 a.m. and 11 a.m., while the sushi department serves none. We know that this particular hour, rather than any of the other morning hours, accounts for the distances in the 1D plot because of the seafood department. That is, the seafood department does not serve its ļ¬rst customer until this hour, and it is closer to the rest of the departments than to the sushi department. If the plot were reļ¬‚ecting the diļ¬€erences at an earlier time, then the seafood department would presumably be right next to the sushi department since neither serves a customer. This relationship is again apparent in the other two datasets: 15
  • 16. 5.3 Takeaways We have seen that a 1D representation of our data is the most ļ¬tting when it has been normalized by hourly totals. The GoF values for these three datasets are reasonably high, and the resulting plots accurately reļ¬‚ect the peak activity in each department at the busiest hour. This information can be useful in planning how to organize a store when the most business is being done. Conversely, 2D representations of the data when normalized by daily totals seem to be more useful than the 1D plots. While the 2D plots have one dimension relating to activity at certain time periods throughout the day (e.g. breakfast time, lunch time, and dinner time) and another relating to the business of departments at one particular hour, the 1D plots only give us insight into the latter. This information is ultimately not helpful in coming to any meaningful conclusions about the activity patterns in each department because of the fact that we only have data from one day. Despite the superļ¬‚uous second dimension, the 2D plots still have one useful dimension, whereas the 1D plots do not have any. Hence, we can best utilize our data to evaluate peak traļ¬ƒc between 4 p.m. and 5 p.m. by normalizing by hourly totals and comparing one-dimensional representations of the departments. Additionally, we can see broad trends in business by normalizing our data by daily totals and representing it in two dimensions. In order to verify the apparent trends, though, we would still need to obtain a larger dataset. 6 Conclusion 6.1 Object of study From the result we got above, we can see several departments are similarly busy at the same time, such as the meat, fresh produce and seafood departments. To avoid congestion in some parts of the grocery store and to maximize the possibility of money customers would spend in the store, the store owner is better to separate these departments. 6.2 Limitations First of all, we only have data for one particular day in that store. This would deļ¬nitely generate some bias on our data and thus make our model less credible. Also, our data are statistics from the register point of view. What we have are the actual purchases in each department, which is only part of the customer behavior. Further, in reality the arrangements of departments could not be ļ¬‚exible. They could be restricted by the locations of warehouses or workbenches. For instance, the sushi department needs a workbench to make fresh sushi every day; a department containing heavy items would prefer somewhere close to its warehouse; a coļ¬€ee shop would deļ¬nitely be close to the entrance or exit. For these departments, the location and size are predetermined at the point of the construction of the store. 6.3 Future work In terms of modeling busyness of departments, we are currently based on number of customers, number of items sold and revenue in each department. There are basic observations from the registers. What happens before people checking out would also be worth considering. If possible, 16
  • 17. we could collect data on the time an average customer spent in each department, regardless of whether he/she buys something in that department. Similarly, the number of customers physically appeared in each department also measures the busyness of that department. In terms of generating a layout the maximizes the sales in the store, there are many aspects worth deeper discussions. In addition to locations of diļ¬€erent departments, we could take sizes of departments into consideration. Detailed placements and sizes of aisles, shelves and items on each shelf would also have signiļ¬cant impact on sales in the store. This would be more realistic since it is easier to make changes on them than on the predetermined locations of departments. Completely diļ¬€erent models and more complicated modeling methods would be required to identify the interrelationships of locations and sizes between diļ¬€erent aisles and shelves. 17
  • 18. References [1] Richardson, M. W. (1938). Psychological Bulletin, 35, 659-660 [2] Torgerson. W. S. (1952). Psychometrika. 17. 401-419. (The ļ¬rst major MDS breakthrough.) [3] Young. F. W. (1984). Research Methods for Multimode Data Analvsis in the Behavioral Sciences. H. G. Law, C. W. Snyder, J. Hattie, and R. P. MacDonald, eds. (An advanced treatment of the most general models in MDS. Geometrically oriented. Interesting political science example of a wide range of MDS models applied to one set of data.) [4] Machado JT, Mata ME (2015) Analysis of World Economic Variables Using Multidimensional Scaling. PLOS ONE 10(3): e0121277 http://dx.doi.org/10.1371/journal.pone.0121277 [5] Anand S, Sen A. The Income Component of the Human Development Index. Journal of Human Development. 2000;1 http://dx.doi.org/10.1371/journal.pone.0121277 [6] World Development Indicators, The World Bank,Time series,17-Nov-2016 http://data.worldbank.org/data-catalog/world-development-indicators [7] Wikipedia contributors. "Minkowski distance." Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 1 Nov. 2016. Web. 1 Nov. 2016. https://en.wikipedia.org/w/index.php?title=Minkowski_distance&oldid=747257101 [8] Editor of Real Simple. "The Secrets Behind Your Grocery Storeā€™s Layout." Real Simple. N.p., 2012. Web. 29 Nov. 2016. http://www.realsimple.com/food-recipes/shopping-storing/ more-shopping-storing/grocery-store-layout [9] Boros, P., FehĆ©r, O., Lakner, Z. et al. Ann Oper Res (2016) 238: 27. doi:10.1007/s10479-015- 1986-2. http://link.springer.com/article/10.1007/s10479-015-1986-2 [10] Li, Chen. "A FACILITY LAYOUT DESIGN METHODOLOGY FOR RETAIL ENVIRON- MENTS." D-Scholarship. N.p., 3 May 2010. Web. 29 Nov. 2016. http://d-scholarship.pitt.edu/9670/1/Dissertation_ChenLi_2010.pdf [11] Ozgormus, Elif. "Optimization of Block Layout for Grocery Stores." Auburn University. N.p., 9 May 2015. Web. 29 Nov. 2016. https://etd.auburn.edu/bitstream/handle/10415/4494/Eozgormusphd.pdf;sequence=2 18
  • 19. Appendix Link to Google Drive: https://drive.google.com/drive/folders/0B-8II7_BkXIbTmZ0aEREQ2RzSzA?usp=sharing A.1 Raw data The printout of data we got from QFC. There are total 10 pages, 1 page for each department. 19
  • 20. 20
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 27
  • 28. 28
  • 29. 29
  • 30. We converted the data in an excel table: Plot of items sold over the course of the day: 30
  • 31. Plot of sales made over the course of the day: Plot customers served over the course of the day: 31
  • 32. 32
  • 33. A.2 Normalized Data Normalized-by-hour spreadsheet: Items normalized by hour: 33
  • 34. Sales normalized by hour: Customers normalized by hour: Normalized-by-day spreadsheet: 34
  • 35. 35
  • 37. Sales normalized by day: Customers normalized by day: 37
  • 38. 38
  • 39. A.3 2D MDS Results 2D plot of items sold, normalized by day: 2D plot of sales made, normalized by day: 2D plot of customers served, normalized by day: 39
  • 40. 2D plot of items sold, normalized by hour: 2D plot of sales made, normalized by hour: 40
  • 41. 2D plot of customers served, normalized by hour: 41
  • 42. A.4 1D MDS Results 1D plot of items sold, normalized by day: 1D plot of sales made, normalized by day: 1D plot of customers served, normalized by day: 42
  • 43. 1D plot of items sold, normalized by hour: 1D plot of sales made, normalized by hour: 43
  • 44. 1D plot of customers served, normalized by hour: 44
  • 45. A.4 Goodness of Fit Tables Table 1: Customers by Dept. (or Day) GOF Euclidian Manhattan Supremum 1 0.3963199 0.4584881 0.3978179 2 0.7600668 0.7807252 0.719283 3 0.8587553 0.8875351 0.866924 4 0.9253756 0.93352 0.9277296 5 0.9536219 0.9620621 0.9548165 6 0.9788168 0.9811779 0.9647046 7 0.9909951 0.99114 0.9672805 8 0.9963022 0.9968744 0.9672805 9 1 0.9968744 0.9672805 Table 2: Customers by Hour GOF Euclidian Manhattan Supremum 1 0.5850668 0.7160555 0.4549306 2 0.776274 0.8507424 0.6912069 3 0.8732725 0.9249754 0.8543061 4 0.9398181 0.9695998 0.920125 5 0.9792807 0.9774312 0.9595707 6 0.9920085 0.9774312 0.9745372 7 0.9962814 0.9774312 0.9773711 8 0.9983624 0.9774312 0.9773711 9 1 0.9774312 0.9773711 Table 3: Items by Dept. (or Day) GOF Euclidian Manhattan Supremum 1 0.4293083 0.4496683 0.4001874 2 0.7539912 0.7742156 0.6793937 3 0.8888276 0.9034419 0.8202917 4 0.933347 0.9557358 0.8875166 5 0.9597362 0.9777854 0.9168588 6 0.9839133 0.9903894 0.9333318 7 0.9932868 0.9957496 0.9356707 8 0.9976301 0.9989588 0.9356707 9 1 0.9989588 0.9356707 45
  • 46. Table 4: Items by Hour GOF Euclidian Manhattan Supremum 1 0.6427679 0.757986 0.4270805 2 0.8166962 0.8722931 0.6788556 3 0.9056558 0.9410536 0.8985873 4 0.9580441 0.9824214 0.9485446 5 0.9869886 0.9876273 0.9670813 6 0.9944574 0.9888287 0.9815083 7 0.9974807 0.9892658 0.9870639 8 0.9991767 0.9892658 0.9870639 9 1 0.9892658 0.9870639 Table 5: Sales by Dept. (or Day) GOF Euclidian Manhattan Supremum 1 0.4846022 0.5108597 0.4146601 2 0.7679321 0.7631536 0.6745012 3 0.8678356 0.8797308 0.8164288 4 0.9385633 0.9405769 0.9091716 5 0.9595338 0.971421 0.9468887 6 0.9772032 0.9925368 0.9649131 7 0.990547 0.9976865 0.9729114 8 0.9965854 0.9997359 0.9729114 9 1 0.9997359 0.9729114 Table 6: Sales by Hour GOF Euclidian Manhattan Supremum 1 0.5269157 0.6498416 0.3607062 2 0.7274713 0.8119829 0.5933631 3 0.8401425 0.8910053 0.7961126 4 0.9072517 0.9563971 0.8790723 5 0.953552 0.9757764 0.9359311 6 0.9754582 0.9857487 0.9747406 7 0.9928391 0.9885393 0.9834409 8 0.9971796 0.9885393 0.9850726 9 1 0.9885393 0.9850726 46