SlideShare a Scribd company logo
1 of 31
Download to read offline
Master’s Project Report
Sales Prediction of 111 Weather Sensitive Products in 45
Walmart Stores using Machine Learning Techniques and
Discussion on its Implications for Inventory Policy
by
Minchao Lin
December 10, 2015
Contents
1 Motivation................................................................................................................................ 3
2 Objectives ................................................................................................................................ 3
3 Data Description ...................................................................................................................... 4
3.1 Training Data and Test Data ................................................................................................. 4
3.2 Data Features......................................................................................................................... 5
3.3 Feature Engineering .............................................................................................................. 6
3.4 Feature Correlation................................................................................................................ 8
4 Models and Techniques ......................................................................................................... 10
4.1 Performance Metric............................................................................................................. 10
4.2 Models................................................................................................................................. 11
4.2.1 Stepwise Linear Regression.......................................................................................... 11
4.2.2 K-Nearest Neighbors Search ........................................................................................ 13
4.2.3 Ensemble Learning....................................................................................................... 17
4.2.4 Combinations of Models .............................................................................................. 19
5 Implications............................................................................................................................ 20
5.1 Cross Validation.................................................................................................................. 20
5.2 Evaluating Forecasts ........................................................................................................... 21
5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock .................... 26
6 Conclusion ............................................................................................................................. 29
7 References.............................................................................................................................. 30
8 Appendices............................................................................................................................. 31
1 Motivation
Demand forecasting and inventory control are two of the most important aspects in
supply chain management. An accurate prediction of demand can not only help replenishment
managers correctly predict the level of inventory needed but also avoid being out of stock or
overstock. To better forecast demand, we need to take into consideration the various factors that
may have significant contribution to the demand variability. For a retail store, extreme weather
events such as hurricanes and blizzards can have a huge impact on sales at the store and product
level. Thus, accurately predicting the sales of potentially weather-sensitive products around the
time of major weather events becomes essential to the timely adjustment in inventory. In
addition, the difference between the predicted and realized demand can also provide further
information for setting the inventory policy such as the level of safety stock.
2 Objectives
The objectives of this project are two-fold. The first objective is to fit an effective model to
predict the sales of 111 potentially weather-sensitive products that are affected by snow and rain
in 45 Walmart retail stores. For each product specifically, the task is to predict the units sold for
a window of ±3 days surrounding each storm. The model performance is evaluated with the
Root Mean Squared Logarithmic Error (RMSLE) and compared with other 485 teams’ results in
the online Walmart recruiting competition. The training data used to generate the model is
provided with actual product demand and actual weather data while the actual demand in the test
data used to evaluate the effectiveness of predicted demand is not provided. The only way to
know the efficiency of the model is by submitting the predicted demand online and obtaining its
RMSLE. Considering that the actual demand in the test data is unknown which will limit further
analysis on the inventory policy of these products, the next objective is introduced. The second
objective of the project is to fully utilize the training data by applying the most effective model
from previous steps via cross validation and compare the predicted demand and actual demand
for each product, then develop analysis on each of their safety stocks.
3 Data Description
3.1 Training Data and Test Data
Sales data for 111 products whose sales may be affected by the weather such as milk, bread and
umbrellas are provided. These 111 products are sold in stores at 45 different Walmart locations.
Each product id is provided but not name or description. The competition teams are reminded
that some of the products are similar but have a different id in different stores. The 45 store
locations are covered by 20 weather stations. Some stores share a weather station. The full
observed weather covering both the training data and test data is provided. Training data contains
4,617,600 observations and test data contains 526,917 observations.
In the following graph, the green dots show the training set days, the red dots show the test set
days, and the event=True are the days with storms. The graph is for 20 weather stations.
Figure 1. Training set days and test set days for 20 weather stations1
.
3.2 Data Features
The features in the training data provided include:
 date
 store id
 Item id
 number of units sold
The features in the weather data provided include:
1
“Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle,” accessed December 9, 2015,
https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data.
 date
 weather station id
 dew point temperature
 wet bulb temperature
 heating degree days
 cooling degree days
 time for sunrise
 time for sunset
 significant weather types
 snowfall in inches
 water equivalent of rainfall and melted snow
 average station pressure
 average sea pressure
 resultant wind speed
 resultant wind direction
 average wind speed
3.3 Feature Engineering
In order to better describe the underlying structure in the data, new features are created based on
the observation and analysis of the provided original data. It is reasonable to assume that sales in
each day may be related to the position of that day in a month, in a year, or in the whole timeframe
of the provided dataset, so new features generated from the date includes day in a month, month,
day in a year, year, numeric number for each date, weekday, and if that day is a holiday or not.
In addition, from observation of the data, it is noticed that sales in each month varies significantly.
Thus, monthly average sales for each product is calculated and serves as another new feature.
Based on the monthly average sales, a binary variable identifying whether the monthly average
sales equals zero is created. Indicating whether the same month has zero sales in each year for a
product can provide further details for the predicted demand during that month, thus improving
the accuracy of the model.
Temperature can be another related feature because too high or too low temperature may influence
a customer’s decision to go out or stay home. In addition, “feels like” temperature may be a better
indicator. Since feels like temperature is related to the moisture in the air, two new features
identifying the moisture in the air in two different ways are created. The first feature calculates the
difference between the dew point temperature and average temperature since this difference
represents the how far away the amount of moisture in the air is from the saturation. The second
feature calculates the difference between the wet bulb temperature and average temperature. This
difference shows the relative humidity in the air. The larger the difference, the lower the relative
humidity.
Features including precipitation and average wind speed are included directly without further
processing. The feature snowfalls is eliminated as it includes too many undefined values (NaN or
empty cells). Resultant wind speed is not included either as it is closely correlated with average
wind speed. The rest of the features in the weather data are ignored either because they are
constructed with too many different text entries that are hard to describe numerically or because
they are not related to the sales of the product intuitively. These features include heating degree
days, cooling degree days, time for sunrise, time for sunset, significant weather types, average
station pressure, average sea pressure and resultant wind direction.
Because some products have a lot of zero sales, I assume that the number of days with zero sales
before or after each day may also have an influence on the sales of that day. Three new features
are created based on this assumption: number of continuous days with zero sales before today,
number of continuous days with zero sales after today, and the minimum of the previous two
features. Besides number of days with zero sales, the average number of sales before or after each
day may also impact the sales of each day. Thus, I created one more variable calculating the
average sales seven days before today, and another variable calculating the average sales seven
days after today. If the seven days before a date are not all included in the training data, which
means some dates are in the test data, the average of only the available sales in the training data
will be calculated.
To conclude, features that are used to build models are:
1. numeric number for the date
2. month
3. day in month
4. year
5. weekday
6. is holiday or not
7. day in year
8. monthly average sales
9. is a month having zero sales or not
10. precipitation
11. average wind speed
12. difference between average temperature and dew point temperature
13. difference between average temperature and wet bulb temperature
14. number of continuous days with zero sales after today
15. number of continuous days with zero sales before today
16. minimum of the number of continuous days with zero sales before or after today
17. average sales seven days before today
18. average sales seven days after today
3.4 Feature Correlation
Because multiple variables are used for generating the model, multicollinearity problem may
arise if these variables are not independent. As a first step towards model specification, it is
useful to identify any possible dependencies among the predictors. The correlation matrix is a
standard measure of the strength of pairwise linear relationships. In the following table, R value
between each numeric variable is calculated:
Variables 1 2 3 4 5 6 7 8 9 10
1 1 0.0066 -0.038 0.035 -0.11 -0.12 -0.29 -0.19 -0.24 0.015
2 0.0066 1 0.027 0.056 -0.20 0.067 -0.35 -0.40 -0.42 0.82
3 -0.038 0.027 1 0.12 -0.37 -0.027 0.023 0.029 0.049 0.033
4 0.035 0.056 0.12 1 0.24 0.020 0.10 -0.071 0.0011 0.064
5 -0.11 -0.20 -0.37 0.24 1 0.040 0.23 0.13 0.18 -0.16
6 -0.12 0.067 -0.027 0.020 0.040 1 -0.047 -0.053 -0.047 -0.035
7 -0.29 -0.35 0.023 0.10 0.23 -0.047 1 0.17 0.58 -0.25
8 -0.19 -0.40 0.029 -0.071 0.13 -0.053 0.17 1 0.58 -0.36
9 -0.24 -0.42 0.049 0.0011 0.18 -0.047 0.58 0.58 1 -0.35
10 0.015 0.82 0.033 0.064 -0.16 -0.035 -0.25 -0.36 -0.35 1
Table 1. R value between each numeric variable
Variables 1 to 10 each represent features: numeric date, monthly average sales, precipitation,
average wind speed, average temperature subtracted by dew point temperature, average
temperature subtracted by wet bulb temperature, number of continuous days with zero sales after
today, number of continuous days with zero sales before today, and minimum value of the
previous two features.
From the table, we observe that only the number of continuous days with zero sales after today
and number of continuous days with zero sales before today have a moderate correlation with
minimum value of the previous two features. These moderate correlation would be dealt with in
the ensemble methods where only a subset of features are selected to generate a decision tree
every time. The other R values show little correlation between each other pair of features.
Besides pairwise correlation, relationships among arbitrary feature subsets may imply
multicollinearity problem. To diagnose multicollinearity, we can calculate the variance inflation
factor (VIF). VIF quantifies the severity of multicollinearity in an ordinary least squares
regression analysis and it is calculated as:
𝑉𝐼𝐹𝑖 =
1
1 − 𝑅𝑖
2
When the variation of feature 𝑖 is largely explained by a linear combination of the other features,
𝑅𝑖
2
is close to and the VIF for that feature is correspondingly large. A rule of thumb is that if
VIF is greater than 10 then multicollinearity is high. Again, VIF for the previous data is
calculated:
Variables 1 2 3 4 5 6 7 8 9 10
VIF 1.20 3.65 1.27 1.18 1.45 1.05 1.91 1.84 2.44 3.24
Table 2. VIF for each variable
The above values show that monthly average sales and minimum value of continuous days of
zero sales before or after today have the two highest VIFs, but their values are still far below the
significant level of 10. Thus we conclude that no significant multicollinearity between variables
exist.
4 Models and Techniques
4.1 Performance Metric
For regression problem, the method of measuring the distance between the estimated outputs and
the actual outputs is used to quantify the model's performance. The Mean Squared Error
penalizes the bigger difference more because of the square effect. On the other hand, if we want
to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The
effect of introducing the logarithm function is to balance the emphasis on small and big
predictive errors. For the Walmart recruiting competition, the submissions of predictions are
evaluated based on the Root Mean Squared Logarithmic Error (RMSLE):
√
1
𝑛
∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2
𝑛
𝑖=1
Where:
 n is the number of hours in the test set
 pi is the predicted count
 ai is the actual count
 log(x) is the natural logarithm
4.2 Models
4.2.1 Stepwise Linear Regression
Stepwise Linear regression creates a linear model and automatically adds or removes terms in the
model based on their statistical significance in a regression. The method begins with an initial
model and then compares the explanatory power of incrementally larger and smaller models
using forward selection and backward elimination. Specifically, at each step, the p values of an F
statistics is computed to test the model with and without a potential term. If a term is not
currently in the model, the null hypothesis is that the term would have a zero coefficient if added
to the model. If the null hypothesis is rejected, then the term that have the smallest p value
among all the terms having p values less than an entrance tolerance will be added to the model.
Conversely, if the term is already in the model, the null hypothesis is that the term has a zero
coefficient and if there is no significant evidence to reject the null hypothesis, the term that has
the greatest p value among all the terms in the model having p values greater than an exit
tolerance will be removed from the model.2
In this sense, stepwise models are locally optimal but
may not be globally optimal.
For this method, five stepwise models were built based on different combinations of variables
(the numbers that represent each feature correspond to the ones listed in section 3.2). The first
four models are listed as below:
2
“Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm,” accessed December 10,
2015, http://www.mathworks.com/help/stats/stepwiselm.html.
RMSLE
of each
models
1 2 3 4 5 6 8 9 10 11 14 15 16 17 18
0.12995 √ √ √ √ √
0.11892 √ √ √ √ √ √ √
0.13218 √ √ √ √ √ √ √ √ √ √ √ √ √
0.19076 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √
Table 3. Stepwise Linear Regression Models
The model having the best RMSLE in the table is the second one with an RMSLE equaling to
0.11892. From the results, we can see that having more features doesn’t necessarily improve the
model. Thus, instead of creating more features, the focus was shifted from the predictor variables
to the response variable. Since the performance metric for the Walmart recruiting online
competition uses log transformation on the difference between the predicted values and actual
values in the test data, log transformation is then applied to the response values (i.e. units sold for
each item in each store) in the training data as an attempted way to improve the performance of
prediction models. In order to avoid negative transformed values, log (1+x) is applied to each
response value. The best result is as follows:
RMSLE 1 2 3 4 5 6 8 9 10 11 14 15 16 17 18
0.10477 √ √ √ √ √
Table 4. Stepwise Linear Regression Models with log-transformed response variable
The above result shows that log transformation of the response value in the training data does
improve the performance. However, it is also observed that even for log-transformed response
values, having more features doesn’t necessarily improve the model. The final ranking of the
best stepwise linear regression model from above is 94/485.
Figure 2. Ranking of Stepwise Linear Regression Model
4.2.2 K-Nearest Neighbors Search
K-Nearest Neighbors Search finds the k closest points in X for each point in Y, the predicted
value is often calculated as the average of those k closest points or the weighted average of the k
closest points using the inverse distance weights. Two different search methods can be used. The
exhaustive search method finds the distance from each query point to every point in X, ranks
them in ascending order, and returns the k points with the smallest distances. Kd-trees search
method divides the data into nodes with a certain bucket size based on coordinates. The closest k
points are found within the node that the query point in Y belongs to. Then points in all other
nodes that are within the distance between the previous k points and the query point are chosen
as well. Using a Kd-tree for large data sets can be much more efficient than using the exhaustive
search method because it only calculates a subset of the distances. Distances can also be
determined with various metrics. The most general distance metric is Euclidean distance. The
other distance metrics that will also be tested later in this section include correlation distance,
spearman distance, and cosine distance, and Hamming distance. Correlation distance is
calculated as one minus the sample linear correlation between observations which are treated as
sequences of values. Spearman distance is calculated as one minus the sample Spearman’s rank
correlation between observations which are treated as sequences of values. Cosine distance is
calculated as one minus the cosine of the included angle between observations which are treated
as vectors. Hamming distance is calculated as the percentage of coordinates that differ. 3
Thus, changing parameters include nearest neighbors search method, methods to calculate
predicted value with the values from the closest neighbors, number of closest neighbors, and
distance metric. The default setting in Matlab is followed to choose search method: exhaustive
search method is used when the number of columns of X is more than 10, and Kd-trees search
method is used otherwise.
For exhaustive search method, all the 18 predictors listed in Section 3.2 are included. Different
distance metric are tested first with the number of closest neighbors set to a fixed number 10.
The results are as follows:
Distance metric RMSLE
Euclidean distance 0.11189
Correlation distance 0.14171
Spearman distance 0.18862
Cosine distance 0.14401
Hamming distance 0.12848
Table 5. Testing Distance Metrics.
From the table, we see that Euclidean distance works significantly better than the other distance
metrics. Thus, for the next step, Euclidean distance is set to be the distance metric. The number
of closest neighbors is still set to 10. Yet instead of using the mean value of the 10 closest
3
“Classification Using Nearest Neighbors - MATLAB & Simulink,” accessed December 10, 2015,
http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html.
neighbors, the weighted average of the k closest points using the inverse distance weights is
used.
Inverse distance weights are defined as
𝑢(𝑥) =
∑ 𝑤𝑖(𝑥)𝑢(𝑥𝑖)𝑁
𝑖=1
∑ 𝑤𝑖(𝑥)𝑁
𝑖=1
where 𝑤𝑖(𝑥) is defined as
𝑤𝑖(𝑥) =
1
𝑑(𝑥, 𝑥𝑖) 𝑝
The result is as follows:
Ways to calculate predicted values RMSLE
Arithmetic mean 0.11189
Weighted mean with inverse distance weights (𝑝 = 1) 0.10341
Weighted mean with inverse distance weights (𝑝 = 2) 0.10473
Weighted mean with inverse distance weights (𝑝 = 3) 0.10732
Weighted mean with inverse distance weights (𝑝 = 7) 0.11666
Table 6. Testing ways to calculate predicted values
The above table shows that weighted mean with inverse distance weights having p = 1 gives the
best RMSLE. In the next step, this way of calculating the final predicted values remains and
different number of closest neighbors to choose for each point in Y are tested. Let k denote the
number of closest neighbors. The results are as follows:
K RMSLE
3 0.11008
10 0.10341
40 0.10215
60 0.10193
80 0.10198
100 0.10200
Table 7. Testing K values.
For Kd-tree search method, only predictors related to time are included. These variables
correspond to the 1, 2, 3, 4, 5, and 7 in Section 3.2.
K RMSLE
20 0.10182
60 0.10126
70 0.10136
Table 8. Kd-tree search method
Figure 3. Ranking of K-Nearest Neighbors Search
To conclude, the best generated k-nearest neighbor model uses the Euclidean distance as the
distance metric, uses weighted mean with inverse distance weights having p = 1 to predict the
response value, uses only variables related to time (numeric date, month, day in month, year, day
in year, and weekday) as the predictor variables, and chooses 60 as the number of closet nearest
neighbors in the algorithm. The best RMSLE returns 0.10126 and ranks 66/485 in the
competition.
4.2.3 Ensemble Learning
Ensemble methods use multiple learning algorithms to obtain better predictive performance than
could be obtained from any of the constituent learning algorithms.4
Among the constituent
learning algorithms, decision tree, neural network and other machine learning algorithms are
commonly used. Decision tree builds regression or classification models in the form of a tree
structure where a dataset is divided into smaller subsets at each node. In a regression tree, a
regression model is fit to the target variable using each of the independent variables. For each
independent variable the data is split at several split points where the squared mean error
between the predicted value and the actual values are calculated. The node chooses to split the
predictor variable at the split point that maximizes the squared mean error reduction.
Regression tree ensembles work with two methods. One is least squares boosting, and the other
is bagging. Least squares boosting fits regression ensembles in order to minimize mean squared
error. At every step, the ensemble fits a new learner to the difference between the observed
response and the aggregated prediction of all learners grown previously.5
The ensemble fits to
minimize mean-squared error. Bagging trains each model in the ensemble using a randomly
drawn subset (with replacement) of the training set and finds the predicted response of a trained
ensemble by taking an average over predictions from individual trees. Furthermore, random
sampling with replacement omits on average 37% of observations for each decision tree and
every tree in the ensemble can randomly select predictors for decision splits.
4
“Ensemble Learning - Wikipedia, the Free Encyclopedia,” accessed December 9, 2015,
https://en.wikipedia.org/wiki/Ensemble_learning.
5
Jerome Friedman et al., “Discussion of Boosting Papers,” Ann. Statist 32 (2004): 102–7.
Since ensembles tend to overtrain, lasso regularization of the ensembles is implemented in order
to choose fewer weak learners with no loss in predictive performance.
To start training the data, both least squares boosting and bagging are applied respectively with
all the predictor variables listed in section 3.2 included. The results are as follows:
Ensemble Learning Methods RMSLE
Least Squares Boosting 0.10388
Bagging 0.10142
Table 9. Ensemble Learning Methods
The results indicate that bagging works much better than least squares boosting. Thus, bagging is
chosen as the ensemble learning method.
In consideration of the potential interactions between each variable, two ways to include more
terms of features are applied. The first method is to include all products of pairs of distinct
predictors into the pool of features and the number of features will increase from 18 to 171 as a
result. The other method is to only include interactions between numerical terms and the number
of features will increase from 18 to 52 accordingly. Ensemble method is then applied to both sets
of data. The result is as follows:
Number of features RMSLE
52 0.11728
171 0.09907
Table 10. Number of features
The result shows that including interaction terms between each pair of predictors significantly
improves the model. Hence the best performance given by regression learning ensembles has an
RMSLE equaling to 0.00907. The result ranks 47/482 in the competition.
Figure 4. Ranking of Ensemble Learning Method
4.2.4 Combinations of Models
In this section, three different combinations of previous generated models are tested in order to
see if there is any improvement in the prediction performance. Specifically, the first combination
takes the median of predicted values from all previous models for each entry in the test data, the
second combination takes the linear combination of the most efficient models from k-nearest
neighbors search and ensemble learning. The third combination is a linear combination of the
three most effective ensemble learning models together with the most effective stepwise linear
regression model. The coefficients for the linear regression are generated by fitting the predicted
values of the training data from each model to the actual values. The results are as follows:
Combinations of Models RMSLE
Median 0.09972
Linear combination of 1 k nearest neighbors and 1 ensemble learning (appendix 1) 0.10384
Linear combination of 1 stepwise linear regression and 3 ensemble learning (appendix 2) 0.09818
Table 11. Combinations of Models
The above table shows that the third combination returns the best result with a ranking of 40/485.
From the graph below, we see that the difference between the current best result and the top
result is around 0.09875 – 0.09340 = 0.00535 for RMSLE. Instead of generating more models to
fit the actual value in the test data to explain the 0.00535 difference, the focus of the project is
shifted to analyzing the current obtained predicted values and their implications on inventory
policy. In the next section, the second objective of the project will be introduced and explained in
details.
Figure 5. Ranking of Combinations of Models
5 Implications
5.1 Cross Validation
Although for the competition, the lower the RMSLE the higher the ranking among the
participating teams, the generality of the model needs further proof. For this reason, cross
validation is applied to the training data while test data is ignored since its actual sales value are
not provided. Specifically, 5-fold cross validation is applied, which means each group of
observations for each product in each store is partitioned into 5 disjoint subsamples (or folds),
chosen randomly but with roughly equal size. Every time, 4 folds are used for training and last
fold is used for evaluation. Predicted values for that last fold is created at the same time. This
process is repeated 5 times, leaving one different fold for evaluation each time. The models used
for training the data are the most effective ones generated in the sections 4.3.1, 4.3.2, and 4.3.3.
RMSLE of each model is ranked in order to compare the effectiveness of prediction performance
from cross validation with those that are submitted to the online competition. The results are as
follows:
testRMSLE ranking trainRMSLE ranking
Stepwise Linear Regression 0.10477 5 0.129844 5
Ensemble Learning – LS Boosting -18 features 0.10388 4 0.122193 3
Ensemble Learning –Bagging -18 features 0.10142 3 0.105286 2
Ensemble Learning –Bagging -171 features 0.09907 2 0.1029 1
Linear combination of the previous 4 models 0.09818 1 0.123611 4
Table 12. Cross Validation
From table above, we notice that linear combination of models does not work well for the cross
validation (ranked number four out of five). If we ignore that last row, the rest four models share
the same ranking in both the RMSLE for test data in the online competition and for cross
validation in the training data. With these results, we are more confident in applying the best
prediction model (Ensemble Learning –Bagging -171 features) to the analysis of inventory policy.
5.2 Evaluating Forecasts
In this section, two common measures of forecast accuracy are applied to the predictions for the
training data generated with cross validation from previous section. Specifically, these two
measures are mean absolute deviation (MAD) and mean absolute percentage error (MAPE).
To calculate these three measures, denote 𝑒𝑖 as the difference between the forecast value and
actual value for each observation in the training data and suppose there are n observations. MAD
and MAPE are calculated as:
MAD = (
1
𝑛
) ∑ |𝑒𝑖|𝑛
𝑖=1
MAPE = [(
1
𝑛
) ∑ |𝑒𝑖/𝐷𝑖|𝑛
𝑖=1 ] × 100%
Because some products have a lot of days with zero sales, 𝐷𝑖 used in MAPE is replaced with
average demand to avoid undefined values. Each of the above measure is applied to each product
in each store. Since there are 255 combinations of different stores and products, 255 MADs and
MAPEs are generated.
It should be noted that in the original model that generates the best result, feature 18 which is
average sales 7 days after today is included. However, when developing the inventory policy
based on the predictions, the data for this feature is obviously not available in real life. For this
reason, feature 18 and its interaction terms with other predictors are eliminated and a new cross-
validated ensemble learning model is built with this new update. MADs and MAPEs are then
calculated. It turned out that feature 18 contributes little to the original model and its elimination
does not have significant influence on the original predicted value. To illustrate this point, the
ranking of variables importance for predicting sales of product 23 in store 8 is shown as an
example:
rank variables importance rank variables importance rank variables importance
1 7 4.47E-04 31 102 4.43E-06 61 12 1.95E-06
2 78 5.98E-05 32 62 4.41E-06 62 37 1.75E-06
3 21 5.11E-05 33 17 4.28E-06 63 58 1.60E-06
4 3 4.66E-05 34 147 4.27E-06 64 76 1.57E-06
5 87 3.88E-05 35 98 4.15E-06 65 38 1.39E-06
6 63 2.41E-05 36 77 4.14E-06 66 35 1.08E-06
7 24 2.29E-05 37 138 4.14E-06 67 4 9.18E-07
8 5 1.98E-05 38 103 4.07E-06 68 39 9.00E-07
9 8 1.45E-05 39 42 3.92E-06 69 80 6.61E-07
10 66 1.28E-05 40 112 3.75E-06 70 55 6.43E-07
11 20 1.08E-05 41 28 3.68E-06 71 22 5.86E-07
12 29 9.34E-06 42 111 3.59E-06 72 101 5.59E-07
13 113 9.33E-06 43 43 3.58E-06 73 71 5.53E-07
14 83 9.29E-06 44 27 3.50E-06 74 132 4.33E-07
15 1 9.14E-06 45 53 3.31E-06 75 127 4.25E-07
16 2 8.70E-06 46 36 3.16E-06 76 44 4.12E-07
17 117 7.68E-06 47 70 3.14E-06 77 126 3.83E-07
18 133 5.70E-06 48 134 3.01E-06 78 89 3.67E-07
19 81 5.40E-06 49 88 2.95E-06 79 68 3.49E-07
20 108 5.38E-06 50 19 2.82E-06 80 41 3.46E-07
21 82 5.17E-06 51 11 2.75E-06 81 128 3.39E-07
22 143 5.04E-06 52 69 2.74E-06 82 10 2.48E-07
23 50 4.84E-06 53 18 2.57E-06 83 110 2.24E-07
24 33 4.76E-06 54 49 2.34E-06 84 6 2.23E-07
25 57 4.68E-06 55 99 2.25E-06 85 26 2.04E-07
26 56 4.67E-06 56 65 2.18E-06 86 64 1.73E-07
27 52 4.60E-06 57 139 2.11E-06 87 92 1.03E-07
28 48 4.52E-06 58 51 2.09E-06 88 93 8.48E-08
29 75 4.48E-06 59 104 2.07E-06 89 94 3.20E-08
30 34 4.47E-06 60 23 2.00E-06
Table 13. Variable Importance
We see that feature 18 (the average sales 7 days after today) ranked 53 among all the features
and it is about half as important as feature 17 (the average sales 7 days before today).
Since MADs and MAPEs each has 255 values, it is not convenient to show them all in the report.
Instead, the detailed values from the top 10 and the last 10 sorted with descending order
according to the average daily sales for each product in each store will be shown while the rest of
the values will be shown in the graphs to indicate the trend in MAD and MAPE. The tables and
graphs are as follows:
Top 10 in average daily sales:
store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand
33 44 189903 914 36.219 0.115 207.771
16 25 135046 857 28.097 0.118 157.580
30 44 136473 868 26.824 0.317 157.227
17 9 135367 939 45.548 0.204 144.161
2 44 117125 875 21.016 0.120 133.857
4 9 117123 960 36.619 0.190 122.003
33 9 101586 914 36.785 0.227 111.144
25 9 98560 1011 28.217 0.157 97.488
34 45 87419 947 15.747 0.125 92.312
38 45 80068 875 15.488 0.130 91.506
Table 14. Top 10 in average daily sales
Bottom 10 in average daily sales:
store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand
16 85 67 857 0.099 0.810 0.078
40 106 78 1011 0.093 1.049 0.077
9 105 73 947 0.099 0.884 0.077
22 104 68 898 0.094 0.883 0.076
38 86 62 875 0.088 0.929 0.071
25 84 69 1011 0.087 0.906 0.068
20 106 61 896 0.085 0.968 0.068
31 104 58 947 0.070 1.025 0.061
34 84 46 947 0.065 0.883 0.049
3 102 31 896 0.045 0.936 0.035
Table 15. Bottom 10 in average daily sales
MADs for each store and item combination sorted according to its average daily sales sorted in
descending order:
Figure 6. MAD
MAPEs for each store and item combination sorted according to its average daily sales sorted in
descending order:
Figure 7. MAPE
0.000
5.000
10.000
15.000
20.000
25.000
30.000
35.000
40.000
45.000
50.000
207.771
92.312
76.205
65.069
57.635
48.457
43.123
37.208
33.200
22.534
15.673
9.592
3.447
1.698
1.279
1.130
1.028
0.941
0.878
0.807
0.763
0.697
0.618
0.581
0.534
0.469
0.366
0.308
0.195
0.146
0.091
0.076
MAD
Average Daily Sales for each store and item combination sorted in descending order
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
1.600
1.800
207.771
97.488
79.357
69.669
63.240
50.127
47.751
41.432
37.208
34.464
26.010
17.442
12.299
5.267
2.845
1.628
1.279
1.151
1.081
0.980
0.902
0.828
0.792
0.749
0.697
0.622
0.597
0.547
0.500
0.435
0.360
0.305
0.195
0.151
0.099
0.078
MAPE
Average Daily of Sales for each store and item combination sorted in descending order
The above plots show that in general the MADs decrease with the number of average daily sales
and MAPEs increase with number of average daily sales. For MAD, some models for store and
item combination do not perform as well as others. This is particularly obvious for items with
large volume of sales. For those models that do not perform as well, extra effort to fit a better
model may be applied as a further approach. For MAPE, we can see a big jump from an average
of around 0.1 to an average of around 0.4 when the sum of average daily sales drops to around
five. Yet it should be noted that The MAPE is scale sensitive and should not be used when
working with low-volume data because when the average demand is very low, the denominator
in MAPE formula will often make MAPE take on extreme values.
5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock
In general, forecasting error variance is higher than the demand variance since forecasting error
also incorporates sampling error. If a forecast is used to estimate the mean demand, we keep
safety stocks in order to protect against the error in the forecast6
. Thus, the standard deviation
(STD) in forecast errors instead of standard deviation in demand should be used to calculate
safety stocks.
When the model is built at the very beginning, it used 5-fold cross validation which means each
prediction group (generated by the model with data from the other four subsamples) accounts for
only one fifth of the overall prediction. Thus, instead of calculating the standard deviation over
all predictions, the average standard deviation of each of the five prediction groups should be
used in order to comply with the cross validation method. Again, graph of averaged STDs
against the mean daily demand for each of the 255 store and item combinations is shown below:
6
Steven Nahmias, Production and Operations Analysis (New York: McGraw-Hill/Irwin, 2009).
Figure 8. Averaged STD
Assuming overnight replenishment and 98% service level (which corresponds to a z-score of
2.05), daily safety stock is calculated as 2 × 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝑆𝑇𝐷. the percentage of daily safety
stock over average daily demand for each store and item combination is shown below:
0.000
10.000
20.000
30.000
40.000
50.000
60.000
70.000
207.771
92.312
76.205
65.069
57.635
48.457
43.123
37.208
33.200
22.534
15.673
9.592
3.447
1.698
1.279
1.130
1.028
0.941
0.878
0.807
0.763
0.697
0.618
0.581
0.534
0.469
0.366
0.308
0.195
0.146
0.091
0.076
averagedSTD
Average Daily Sales for each store and item combination sorted in descending order
Figure 9. Percentage of safety stock over average daily demand
Part of the previous graph with only average daily sales above 5 products is shown below:
Figure 10. Percentage of safety stock over average daily demand that are above 5 units
0.000
5.000
10.000
15.000
20.000
25.000
30.000
207.771
92.312
76.205
65.069
57.635
48.457
43.123
37.208
33.200
22.534
15.673
9.592
3.447
1.698
1.279
1.130
1.028
0.941
0.878
0.807
0.763
0.697
0.618
0.581
0.534
0.469
0.366
0.308
0.195
0.146
0.091
0.076
percentageofsafetystockoveraveragedailydemand
Average Daily Sales for each store and item combination sorted in descending order
0.000
0.500
1.000
1.500
2.000
207.771
144.161
111.144
91.506
81.035
79.279
72.614
69.669
65.069
63.392
62.395
54.687
49.769
48.497
47.751
45.880
43.123
39.888
37.218
36.975
35.486
34.464
32.697
28.770
22.534
17.920
16.850
13.873
12.299
11.200
7.164
percentageofsafetystockover
averagedailydemand
Average Daily Sales for each store and item combination sorted in descending order
From the plot, we notice that for the products that have average daily sales below five, the
percentage of safety stock over average daily demand increase dramatically and has very
unstable fluctuation. This situation poses a question of whether it is profitable to maintain those
low demand products in stock since the number of safety stock for these products is much larger
than its daily demand. However, similar to the problem with MAPE, when the average daily
demand is very close to zero, its location in the denominator will often make the percentage take
on very high values. This may partially account for the high spikes in the graph.
6 Conclusion
For the first objective to fit an effective model in order to lower RMSLE in the test data, three
different methods with different model parameters are sequentially tested. Stepwise linear
regression provides the highest RMSLE among the three methods. K-nearest neighbors Search
generates a better result, and ensemble learning provides the best prediction performance. Linear
combination improves the prediction performance for the test data even further, although this
combination cannot be applied generally which is indicated by its poor performance when tested
with only the training data using cross validation. The variable importance implies that weather
information is not significant in predicting the daily sales. Instead, features related to time
contribute a lot more and rank among the top features in the importance ranking. Thus, although
these products are assumed to be weather-sensitive, weather does not influence their sales as
much as it is originally supposed. Future research on other machine learning techniques may
further improve the prediction performance. However, the robustness of model should always be
kept in mind when the prediction is going to be used in business activities such as setting up the
inventory policy.
The second objective allows us to dive into the implications from the predictions. With cross
validation, ensemble tree model proves its robustness. It is natural that MAD decrease with
average daily demand, yet the products with rather large MAD compared to those having similar
average daily demand may require more attention for further model improvement. In addition,
the two spikes in MAPE before the aforementioned jump at around 5 average daily sales impose
concern. The models for these two spikes should be further tested with other machine learning
techniques. Finally, the calculated safety stock and its value as percentage of average daily
demand poses a question of whether the products is profitable to be maintained on the store
shelves. Although no further data is provided, inventory costs such as holding cost of high
inventory, the obsolescence cost, the ordering cost, the storage space costs, and the transportation
costs for those products should all be taken into account when more detailed information
regarding those products become available.
7 References
“Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle.” Accessed December 9, 2015.
https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data.
“Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm.” Accessed
December 10, 2015. http://www.mathworks.com/help/stats/stepwiselm.html.
“Classification Using Nearest Neighbors - MATLAB & Simulink.” Accessed December 10, 2015.
http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html.
Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of
Boosting Papers.” Ann. Statist 32 (2004): 102–7.
“Ensemble Learning - Wikipedia, the Free Encyclopedia.” Accessed December 9, 2015.
https://en.wikipedia.org/wiki/Ensemble_learning.
Nahmias, Steven. Production and Operations Analysis. New York: McGraw-Hill/Irwin, 2009.
8 Appendices
1. Linear regression model of 1 k nearest neighbors and 1 ensemble learning:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
________ _________ ______ ___________
(Intercept) 0.24598 0.044419 5.5377 3.0688e-08
Ensemble learning 0.85715 0.0058669 146.1 0
K-nearest neighbors 0.21063 0.0055461 37.977 1.2375e-314
Root Mean Squared Error: 18.8
R-squared: 0.773, Adjusted R-Squared 0.773
2. Linear regression model of 1 stepwise linear regression and 3 ensemble learning:
y ~ 1 + x1 + x2 + x3 + x4
Estimated Coefficients:
Estimate SE tStat pValue
________ _________ _______ __________
(Intercept) -0.18353 0.033454 -5.486 4.1145e-08
x1 0.2025 0.0043553 46.495 0
x2 0.33965 0.0049085 69.195 0
x3 -0.36155 0.017711 -20.414 1.5038e-92
x4 0.8802 0.017254 51.014 0
Root Mean Squared Error: 14.3
R-squared: 0.868, Adjusted R-Squared 0.868

More Related Content

Similar to Master's Project Report - Minchao Lin

Methodology How To Implement Screen Communication In A Profittable Way
Methodology How To Implement Screen Communication In A Profittable WayMethodology How To Implement Screen Communication In A Profittable Way
Methodology How To Implement Screen Communication In A Profittable WayJon Marius Bastoe
 
Rollout solution template SAP SD
Rollout solution template   SAP SDRollout solution template   SAP SD
Rollout solution template SAP SDMohammed Azhad
 
Inventory management, loading strategy and warehouse categorization
Inventory management, loading strategy and warehouse categorizationInventory management, loading strategy and warehouse categorization
Inventory management, loading strategy and warehouse categorizationMihir Sangodkar
 
SAP ERP1. Describe the Production Planning and Execution process.docx
SAP ERP1. Describe the Production Planning and Execution process.docxSAP ERP1. Describe the Production Planning and Execution process.docx
SAP ERP1. Describe the Production Planning and Execution process.docxkenjordan97598
 
Ecommerce Market Mix Modeling using Linear Regression
Ecommerce Market Mix Modeling using Linear RegressionEcommerce Market Mix Modeling using Linear Regression
Ecommerce Market Mix Modeling using Linear RegressionAchal Kagwad
 
Cant track 28 kp is focus on these top 10
Cant track 28 kp is focus on these top 10Cant track 28 kp is focus on these top 10
Cant track 28 kp is focus on these top 10MRPeasy
 
In today’s industry, a supply chain manager should be ready to c.docx
In today’s industry, a supply chain manager should be ready to c.docxIn today’s industry, a supply chain manager should be ready to c.docx
In today’s industry, a supply chain manager should be ready to c.docxbradburgess22840
 
Retail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdownRetail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdownSmitha Mysore Lokesh
 
Chapter 4 5 Inventory.pptx
Chapter 4  5 Inventory.pptxChapter 4  5 Inventory.pptx
Chapter 4 5 Inventory.pptxSheldon Byron
 
1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy
1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy
1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy1WorldSync
 
Walmart Sales Prediction Using Rapidminer Prepared by Naga.docx
Walmart Sales Prediction Using Rapidminer Prepared by  Naga.docxWalmart Sales Prediction Using Rapidminer Prepared by  Naga.docx
Walmart Sales Prediction Using Rapidminer Prepared by Naga.docxcelenarouzie
 
SALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdfSALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdfSparkflows
 
Ahmed Elmalla - Business Case KACST
Ahmed Elmalla  - Business Case KACSTAhmed Elmalla  - Business Case KACST
Ahmed Elmalla - Business Case KACSTAhmed Elmalla
 
8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docx8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docxblondellchancy
 
8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docx8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docxsodhi3
 
5 How The Model Works (With Notes)
5 How The Model Works (With Notes)5 How The Model Works (With Notes)
5 How The Model Works (With Notes)Abhishek Datta
 
Leverage IoT to Setup Smart Manufacturing Solutions
Leverage IoT to Setup Smart Manufacturing SolutionsLeverage IoT to Setup Smart Manufacturing Solutions
Leverage IoT to Setup Smart Manufacturing SolutionsSoftweb Solutions
 
How to achieve traceability in manufacturing?
How to achieve traceability in manufacturing?How to achieve traceability in manufacturing?
How to achieve traceability in manufacturing?MRPeasy
 
21 hand out on waste quantification -samantha
21  hand out on waste quantification -samantha21  hand out on waste quantification -samantha
21 hand out on waste quantification -samanthazubeditufail
 

Similar to Master's Project Report - Minchao Lin (20)

Methodology How To Implement Screen Communication In A Profittable Way
Methodology How To Implement Screen Communication In A Profittable WayMethodology How To Implement Screen Communication In A Profittable Way
Methodology How To Implement Screen Communication In A Profittable Way
 
Rollout solution template SAP SD
Rollout solution template   SAP SDRollout solution template   SAP SD
Rollout solution template SAP SD
 
Inventory management, loading strategy and warehouse categorization
Inventory management, loading strategy and warehouse categorizationInventory management, loading strategy and warehouse categorization
Inventory management, loading strategy and warehouse categorization
 
SAP ERP1. Describe the Production Planning and Execution process.docx
SAP ERP1. Describe the Production Planning and Execution process.docxSAP ERP1. Describe the Production Planning and Execution process.docx
SAP ERP1. Describe the Production Planning and Execution process.docx
 
Ecommerce Market Mix Modeling using Linear Regression
Ecommerce Market Mix Modeling using Linear RegressionEcommerce Market Mix Modeling using Linear Regression
Ecommerce Market Mix Modeling using Linear Regression
 
Cant track 28 kp is focus on these top 10
Cant track 28 kp is focus on these top 10Cant track 28 kp is focus on these top 10
Cant track 28 kp is focus on these top 10
 
In today’s industry, a supply chain manager should be ready to c.docx
In today’s industry, a supply chain manager should be ready to c.docxIn today’s industry, a supply chain manager should be ready to c.docx
In today’s industry, a supply chain manager should be ready to c.docx
 
Retail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdownRetail analytics - Improvising pricing strategy using markup/markdown
Retail analytics - Improvising pricing strategy using markup/markdown
 
Chapter 4 5 Inventory.pptx
Chapter 4  5 Inventory.pptxChapter 4  5 Inventory.pptx
Chapter 4 5 Inventory.pptx
 
1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy
1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy
1 WorldSync 5 Point Best Practice Process to Improve Product Data Accuracy
 
Walmart Sales Prediction Using Rapidminer Prepared by Naga.docx
Walmart Sales Prediction Using Rapidminer Prepared by  Naga.docxWalmart Sales Prediction Using Rapidminer Prepared by  Naga.docx
Walmart Sales Prediction Using Rapidminer Prepared by Naga.docx
 
SALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdfSALES_FORECASTING of sparkflows.pdf
SALES_FORECASTING of sparkflows.pdf
 
Brs3e online only-ch12
Brs3e online only-ch12Brs3e online only-ch12
Brs3e online only-ch12
 
Ahmed Elmalla - Business Case KACST
Ahmed Elmalla  - Business Case KACSTAhmed Elmalla  - Business Case KACST
Ahmed Elmalla - Business Case KACST
 
8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docx8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docx
 
8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docx8©FotosearchSuperStockCapacity DecisionsLearning .docx
8©FotosearchSuperStockCapacity DecisionsLearning .docx
 
5 How The Model Works (With Notes)
5 How The Model Works (With Notes)5 How The Model Works (With Notes)
5 How The Model Works (With Notes)
 
Leverage IoT to Setup Smart Manufacturing Solutions
Leverage IoT to Setup Smart Manufacturing SolutionsLeverage IoT to Setup Smart Manufacturing Solutions
Leverage IoT to Setup Smart Manufacturing Solutions
 
How to achieve traceability in manufacturing?
How to achieve traceability in manufacturing?How to achieve traceability in manufacturing?
How to achieve traceability in manufacturing?
 
21 hand out on waste quantification -samantha
21  hand out on waste quantification -samantha21  hand out on waste quantification -samantha
21 hand out on waste quantification -samantha
 

Master's Project Report - Minchao Lin

  • 1. Master’s Project Report Sales Prediction of 111 Weather Sensitive Products in 45 Walmart Stores using Machine Learning Techniques and Discussion on its Implications for Inventory Policy by Minchao Lin December 10, 2015
  • 2. Contents 1 Motivation................................................................................................................................ 3 2 Objectives ................................................................................................................................ 3 3 Data Description ...................................................................................................................... 4 3.1 Training Data and Test Data ................................................................................................. 4 3.2 Data Features......................................................................................................................... 5 3.3 Feature Engineering .............................................................................................................. 6 3.4 Feature Correlation................................................................................................................ 8 4 Models and Techniques ......................................................................................................... 10 4.1 Performance Metric............................................................................................................. 10 4.2 Models................................................................................................................................. 11 4.2.1 Stepwise Linear Regression.......................................................................................... 11 4.2.2 K-Nearest Neighbors Search ........................................................................................ 13 4.2.3 Ensemble Learning....................................................................................................... 17 4.2.4 Combinations of Models .............................................................................................. 19 5 Implications............................................................................................................................ 20 5.1 Cross Validation.................................................................................................................. 20 5.2 Evaluating Forecasts ........................................................................................................... 21 5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock .................... 26 6 Conclusion ............................................................................................................................. 29 7 References.............................................................................................................................. 30 8 Appendices............................................................................................................................. 31
  • 3. 1 Motivation Demand forecasting and inventory control are two of the most important aspects in supply chain management. An accurate prediction of demand can not only help replenishment managers correctly predict the level of inventory needed but also avoid being out of stock or overstock. To better forecast demand, we need to take into consideration the various factors that may have significant contribution to the demand variability. For a retail store, extreme weather events such as hurricanes and blizzards can have a huge impact on sales at the store and product level. Thus, accurately predicting the sales of potentially weather-sensitive products around the time of major weather events becomes essential to the timely adjustment in inventory. In addition, the difference between the predicted and realized demand can also provide further information for setting the inventory policy such as the level of safety stock. 2 Objectives The objectives of this project are two-fold. The first objective is to fit an effective model to predict the sales of 111 potentially weather-sensitive products that are affected by snow and rain in 45 Walmart retail stores. For each product specifically, the task is to predict the units sold for a window of ±3 days surrounding each storm. The model performance is evaluated with the Root Mean Squared Logarithmic Error (RMSLE) and compared with other 485 teams’ results in the online Walmart recruiting competition. The training data used to generate the model is provided with actual product demand and actual weather data while the actual demand in the test data used to evaluate the effectiveness of predicted demand is not provided. The only way to know the efficiency of the model is by submitting the predicted demand online and obtaining its RMSLE. Considering that the actual demand in the test data is unknown which will limit further
  • 4. analysis on the inventory policy of these products, the next objective is introduced. The second objective of the project is to fully utilize the training data by applying the most effective model from previous steps via cross validation and compare the predicted demand and actual demand for each product, then develop analysis on each of their safety stocks. 3 Data Description 3.1 Training Data and Test Data Sales data for 111 products whose sales may be affected by the weather such as milk, bread and umbrellas are provided. These 111 products are sold in stores at 45 different Walmart locations. Each product id is provided but not name or description. The competition teams are reminded that some of the products are similar but have a different id in different stores. The 45 store locations are covered by 20 weather stations. Some stores share a weather station. The full observed weather covering both the training data and test data is provided. Training data contains 4,617,600 observations and test data contains 526,917 observations. In the following graph, the green dots show the training set days, the red dots show the test set days, and the event=True are the days with storms. The graph is for 20 weather stations.
  • 5. Figure 1. Training set days and test set days for 20 weather stations1 . 3.2 Data Features The features in the training data provided include:  date  store id  Item id  number of units sold The features in the weather data provided include: 1 “Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle,” accessed December 9, 2015, https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data.
  • 6.  date  weather station id  dew point temperature  wet bulb temperature  heating degree days  cooling degree days  time for sunrise  time for sunset  significant weather types  snowfall in inches  water equivalent of rainfall and melted snow  average station pressure  average sea pressure  resultant wind speed  resultant wind direction  average wind speed 3.3 Feature Engineering In order to better describe the underlying structure in the data, new features are created based on the observation and analysis of the provided original data. It is reasonable to assume that sales in each day may be related to the position of that day in a month, in a year, or in the whole timeframe of the provided dataset, so new features generated from the date includes day in a month, month, day in a year, year, numeric number for each date, weekday, and if that day is a holiday or not. In addition, from observation of the data, it is noticed that sales in each month varies significantly. Thus, monthly average sales for each product is calculated and serves as another new feature. Based on the monthly average sales, a binary variable identifying whether the monthly average sales equals zero is created. Indicating whether the same month has zero sales in each year for a product can provide further details for the predicted demand during that month, thus improving the accuracy of the model.
  • 7. Temperature can be another related feature because too high or too low temperature may influence a customer’s decision to go out or stay home. In addition, “feels like” temperature may be a better indicator. Since feels like temperature is related to the moisture in the air, two new features identifying the moisture in the air in two different ways are created. The first feature calculates the difference between the dew point temperature and average temperature since this difference represents the how far away the amount of moisture in the air is from the saturation. The second feature calculates the difference between the wet bulb temperature and average temperature. This difference shows the relative humidity in the air. The larger the difference, the lower the relative humidity. Features including precipitation and average wind speed are included directly without further processing. The feature snowfalls is eliminated as it includes too many undefined values (NaN or empty cells). Resultant wind speed is not included either as it is closely correlated with average wind speed. The rest of the features in the weather data are ignored either because they are constructed with too many different text entries that are hard to describe numerically or because they are not related to the sales of the product intuitively. These features include heating degree days, cooling degree days, time for sunrise, time for sunset, significant weather types, average station pressure, average sea pressure and resultant wind direction. Because some products have a lot of zero sales, I assume that the number of days with zero sales before or after each day may also have an influence on the sales of that day. Three new features are created based on this assumption: number of continuous days with zero sales before today, number of continuous days with zero sales after today, and the minimum of the previous two features. Besides number of days with zero sales, the average number of sales before or after each day may also impact the sales of each day. Thus, I created one more variable calculating the
  • 8. average sales seven days before today, and another variable calculating the average sales seven days after today. If the seven days before a date are not all included in the training data, which means some dates are in the test data, the average of only the available sales in the training data will be calculated. To conclude, features that are used to build models are: 1. numeric number for the date 2. month 3. day in month 4. year 5. weekday 6. is holiday or not 7. day in year 8. monthly average sales 9. is a month having zero sales or not 10. precipitation 11. average wind speed 12. difference between average temperature and dew point temperature 13. difference between average temperature and wet bulb temperature 14. number of continuous days with zero sales after today 15. number of continuous days with zero sales before today 16. minimum of the number of continuous days with zero sales before or after today 17. average sales seven days before today 18. average sales seven days after today 3.4 Feature Correlation Because multiple variables are used for generating the model, multicollinearity problem may arise if these variables are not independent. As a first step towards model specification, it is useful to identify any possible dependencies among the predictors. The correlation matrix is a standard measure of the strength of pairwise linear relationships. In the following table, R value between each numeric variable is calculated: Variables 1 2 3 4 5 6 7 8 9 10 1 1 0.0066 -0.038 0.035 -0.11 -0.12 -0.29 -0.19 -0.24 0.015 2 0.0066 1 0.027 0.056 -0.20 0.067 -0.35 -0.40 -0.42 0.82
  • 9. 3 -0.038 0.027 1 0.12 -0.37 -0.027 0.023 0.029 0.049 0.033 4 0.035 0.056 0.12 1 0.24 0.020 0.10 -0.071 0.0011 0.064 5 -0.11 -0.20 -0.37 0.24 1 0.040 0.23 0.13 0.18 -0.16 6 -0.12 0.067 -0.027 0.020 0.040 1 -0.047 -0.053 -0.047 -0.035 7 -0.29 -0.35 0.023 0.10 0.23 -0.047 1 0.17 0.58 -0.25 8 -0.19 -0.40 0.029 -0.071 0.13 -0.053 0.17 1 0.58 -0.36 9 -0.24 -0.42 0.049 0.0011 0.18 -0.047 0.58 0.58 1 -0.35 10 0.015 0.82 0.033 0.064 -0.16 -0.035 -0.25 -0.36 -0.35 1 Table 1. R value between each numeric variable Variables 1 to 10 each represent features: numeric date, monthly average sales, precipitation, average wind speed, average temperature subtracted by dew point temperature, average temperature subtracted by wet bulb temperature, number of continuous days with zero sales after today, number of continuous days with zero sales before today, and minimum value of the previous two features. From the table, we observe that only the number of continuous days with zero sales after today and number of continuous days with zero sales before today have a moderate correlation with minimum value of the previous two features. These moderate correlation would be dealt with in the ensemble methods where only a subset of features are selected to generate a decision tree every time. The other R values show little correlation between each other pair of features. Besides pairwise correlation, relationships among arbitrary feature subsets may imply multicollinearity problem. To diagnose multicollinearity, we can calculate the variance inflation factor (VIF). VIF quantifies the severity of multicollinearity in an ordinary least squares regression analysis and it is calculated as: 𝑉𝐼𝐹𝑖 = 1 1 − 𝑅𝑖 2 When the variation of feature 𝑖 is largely explained by a linear combination of the other features, 𝑅𝑖 2 is close to and the VIF for that feature is correspondingly large. A rule of thumb is that if
  • 10. VIF is greater than 10 then multicollinearity is high. Again, VIF for the previous data is calculated: Variables 1 2 3 4 5 6 7 8 9 10 VIF 1.20 3.65 1.27 1.18 1.45 1.05 1.91 1.84 2.44 3.24 Table 2. VIF for each variable The above values show that monthly average sales and minimum value of continuous days of zero sales before or after today have the two highest VIFs, but their values are still far below the significant level of 10. Thus we conclude that no significant multicollinearity between variables exist. 4 Models and Techniques 4.1 Performance Metric For regression problem, the method of measuring the distance between the estimated outputs and the actual outputs is used to quantify the model's performance. The Mean Squared Error penalizes the bigger difference more because of the square effect. On the other hand, if we want to reduce the penalty of bigger difference, we can log transform the numeric quantity first. The effect of introducing the logarithm function is to balance the emphasis on small and big predictive errors. For the Walmart recruiting competition, the submissions of predictions are evaluated based on the Root Mean Squared Logarithmic Error (RMSLE): √ 1 𝑛 ∑(log(𝑝𝑖 + 1) − log(𝑎𝑖 + 1))2 𝑛 𝑖=1 Where:  n is the number of hours in the test set
  • 11.  pi is the predicted count  ai is the actual count  log(x) is the natural logarithm 4.2 Models 4.2.1 Stepwise Linear Regression Stepwise Linear regression creates a linear model and automatically adds or removes terms in the model based on their statistical significance in a regression. The method begins with an initial model and then compares the explanatory power of incrementally larger and smaller models using forward selection and backward elimination. Specifically, at each step, the p values of an F statistics is computed to test the model with and without a potential term. If a term is not currently in the model, the null hypothesis is that the term would have a zero coefficient if added to the model. If the null hypothesis is rejected, then the term that have the smallest p value among all the terms having p values less than an entrance tolerance will be added to the model. Conversely, if the term is already in the model, the null hypothesis is that the term has a zero coefficient and if there is no significant evidence to reject the null hypothesis, the term that has the greatest p value among all the terms in the model having p values greater than an exit tolerance will be removed from the model.2 In this sense, stepwise models are locally optimal but may not be globally optimal. For this method, five stepwise models were built based on different combinations of variables (the numbers that represent each feature correspond to the ones listed in section 3.2). The first four models are listed as below: 2 “Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm,” accessed December 10, 2015, http://www.mathworks.com/help/stats/stepwiselm.html.
  • 12. RMSLE of each models 1 2 3 4 5 6 8 9 10 11 14 15 16 17 18 0.12995 √ √ √ √ √ 0.11892 √ √ √ √ √ √ √ 0.13218 √ √ √ √ √ √ √ √ √ √ √ √ √ 0.19076 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Table 3. Stepwise Linear Regression Models The model having the best RMSLE in the table is the second one with an RMSLE equaling to 0.11892. From the results, we can see that having more features doesn’t necessarily improve the model. Thus, instead of creating more features, the focus was shifted from the predictor variables to the response variable. Since the performance metric for the Walmart recruiting online competition uses log transformation on the difference between the predicted values and actual values in the test data, log transformation is then applied to the response values (i.e. units sold for each item in each store) in the training data as an attempted way to improve the performance of prediction models. In order to avoid negative transformed values, log (1+x) is applied to each response value. The best result is as follows: RMSLE 1 2 3 4 5 6 8 9 10 11 14 15 16 17 18 0.10477 √ √ √ √ √ Table 4. Stepwise Linear Regression Models with log-transformed response variable The above result shows that log transformation of the response value in the training data does improve the performance. However, it is also observed that even for log-transformed response values, having more features doesn’t necessarily improve the model. The final ranking of the best stepwise linear regression model from above is 94/485.
  • 13. Figure 2. Ranking of Stepwise Linear Regression Model 4.2.2 K-Nearest Neighbors Search K-Nearest Neighbors Search finds the k closest points in X for each point in Y, the predicted value is often calculated as the average of those k closest points or the weighted average of the k closest points using the inverse distance weights. Two different search methods can be used. The exhaustive search method finds the distance from each query point to every point in X, ranks them in ascending order, and returns the k points with the smallest distances. Kd-trees search method divides the data into nodes with a certain bucket size based on coordinates. The closest k points are found within the node that the query point in Y belongs to. Then points in all other nodes that are within the distance between the previous k points and the query point are chosen as well. Using a Kd-tree for large data sets can be much more efficient than using the exhaustive search method because it only calculates a subset of the distances. Distances can also be determined with various metrics. The most general distance metric is Euclidean distance. The other distance metrics that will also be tested later in this section include correlation distance, spearman distance, and cosine distance, and Hamming distance. Correlation distance is calculated as one minus the sample linear correlation between observations which are treated as
  • 14. sequences of values. Spearman distance is calculated as one minus the sample Spearman’s rank correlation between observations which are treated as sequences of values. Cosine distance is calculated as one minus the cosine of the included angle between observations which are treated as vectors. Hamming distance is calculated as the percentage of coordinates that differ. 3 Thus, changing parameters include nearest neighbors search method, methods to calculate predicted value with the values from the closest neighbors, number of closest neighbors, and distance metric. The default setting in Matlab is followed to choose search method: exhaustive search method is used when the number of columns of X is more than 10, and Kd-trees search method is used otherwise. For exhaustive search method, all the 18 predictors listed in Section 3.2 are included. Different distance metric are tested first with the number of closest neighbors set to a fixed number 10. The results are as follows: Distance metric RMSLE Euclidean distance 0.11189 Correlation distance 0.14171 Spearman distance 0.18862 Cosine distance 0.14401 Hamming distance 0.12848 Table 5. Testing Distance Metrics. From the table, we see that Euclidean distance works significantly better than the other distance metrics. Thus, for the next step, Euclidean distance is set to be the distance metric. The number of closest neighbors is still set to 10. Yet instead of using the mean value of the 10 closest 3 “Classification Using Nearest Neighbors - MATLAB & Simulink,” accessed December 10, 2015, http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html.
  • 15. neighbors, the weighted average of the k closest points using the inverse distance weights is used. Inverse distance weights are defined as 𝑢(𝑥) = ∑ 𝑤𝑖(𝑥)𝑢(𝑥𝑖)𝑁 𝑖=1 ∑ 𝑤𝑖(𝑥)𝑁 𝑖=1 where 𝑤𝑖(𝑥) is defined as 𝑤𝑖(𝑥) = 1 𝑑(𝑥, 𝑥𝑖) 𝑝 The result is as follows: Ways to calculate predicted values RMSLE Arithmetic mean 0.11189 Weighted mean with inverse distance weights (𝑝 = 1) 0.10341 Weighted mean with inverse distance weights (𝑝 = 2) 0.10473 Weighted mean with inverse distance weights (𝑝 = 3) 0.10732 Weighted mean with inverse distance weights (𝑝 = 7) 0.11666 Table 6. Testing ways to calculate predicted values The above table shows that weighted mean with inverse distance weights having p = 1 gives the best RMSLE. In the next step, this way of calculating the final predicted values remains and different number of closest neighbors to choose for each point in Y are tested. Let k denote the number of closest neighbors. The results are as follows: K RMSLE 3 0.11008 10 0.10341 40 0.10215 60 0.10193 80 0.10198
  • 16. 100 0.10200 Table 7. Testing K values. For Kd-tree search method, only predictors related to time are included. These variables correspond to the 1, 2, 3, 4, 5, and 7 in Section 3.2. K RMSLE 20 0.10182 60 0.10126 70 0.10136 Table 8. Kd-tree search method Figure 3. Ranking of K-Nearest Neighbors Search To conclude, the best generated k-nearest neighbor model uses the Euclidean distance as the distance metric, uses weighted mean with inverse distance weights having p = 1 to predict the response value, uses only variables related to time (numeric date, month, day in month, year, day in year, and weekday) as the predictor variables, and chooses 60 as the number of closet nearest neighbors in the algorithm. The best RMSLE returns 0.10126 and ranks 66/485 in the competition.
  • 17. 4.2.3 Ensemble Learning Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms.4 Among the constituent learning algorithms, decision tree, neural network and other machine learning algorithms are commonly used. Decision tree builds regression or classification models in the form of a tree structure where a dataset is divided into smaller subsets at each node. In a regression tree, a regression model is fit to the target variable using each of the independent variables. For each independent variable the data is split at several split points where the squared mean error between the predicted value and the actual values are calculated. The node chooses to split the predictor variable at the split point that maximizes the squared mean error reduction. Regression tree ensembles work with two methods. One is least squares boosting, and the other is bagging. Least squares boosting fits regression ensembles in order to minimize mean squared error. At every step, the ensemble fits a new learner to the difference between the observed response and the aggregated prediction of all learners grown previously.5 The ensemble fits to minimize mean-squared error. Bagging trains each model in the ensemble using a randomly drawn subset (with replacement) of the training set and finds the predicted response of a trained ensemble by taking an average over predictions from individual trees. Furthermore, random sampling with replacement omits on average 37% of observations for each decision tree and every tree in the ensemble can randomly select predictors for decision splits. 4 “Ensemble Learning - Wikipedia, the Free Encyclopedia,” accessed December 9, 2015, https://en.wikipedia.org/wiki/Ensemble_learning. 5 Jerome Friedman et al., “Discussion of Boosting Papers,” Ann. Statist 32 (2004): 102–7.
  • 18. Since ensembles tend to overtrain, lasso regularization of the ensembles is implemented in order to choose fewer weak learners with no loss in predictive performance. To start training the data, both least squares boosting and bagging are applied respectively with all the predictor variables listed in section 3.2 included. The results are as follows: Ensemble Learning Methods RMSLE Least Squares Boosting 0.10388 Bagging 0.10142 Table 9. Ensemble Learning Methods The results indicate that bagging works much better than least squares boosting. Thus, bagging is chosen as the ensemble learning method. In consideration of the potential interactions between each variable, two ways to include more terms of features are applied. The first method is to include all products of pairs of distinct predictors into the pool of features and the number of features will increase from 18 to 171 as a result. The other method is to only include interactions between numerical terms and the number of features will increase from 18 to 52 accordingly. Ensemble method is then applied to both sets of data. The result is as follows: Number of features RMSLE 52 0.11728 171 0.09907 Table 10. Number of features The result shows that including interaction terms between each pair of predictors significantly improves the model. Hence the best performance given by regression learning ensembles has an RMSLE equaling to 0.00907. The result ranks 47/482 in the competition.
  • 19. Figure 4. Ranking of Ensemble Learning Method 4.2.4 Combinations of Models In this section, three different combinations of previous generated models are tested in order to see if there is any improvement in the prediction performance. Specifically, the first combination takes the median of predicted values from all previous models for each entry in the test data, the second combination takes the linear combination of the most efficient models from k-nearest neighbors search and ensemble learning. The third combination is a linear combination of the three most effective ensemble learning models together with the most effective stepwise linear regression model. The coefficients for the linear regression are generated by fitting the predicted values of the training data from each model to the actual values. The results are as follows: Combinations of Models RMSLE Median 0.09972 Linear combination of 1 k nearest neighbors and 1 ensemble learning (appendix 1) 0.10384 Linear combination of 1 stepwise linear regression and 3 ensemble learning (appendix 2) 0.09818 Table 11. Combinations of Models The above table shows that the third combination returns the best result with a ranking of 40/485. From the graph below, we see that the difference between the current best result and the top
  • 20. result is around 0.09875 – 0.09340 = 0.00535 for RMSLE. Instead of generating more models to fit the actual value in the test data to explain the 0.00535 difference, the focus of the project is shifted to analyzing the current obtained predicted values and their implications on inventory policy. In the next section, the second objective of the project will be introduced and explained in details. Figure 5. Ranking of Combinations of Models 5 Implications 5.1 Cross Validation Although for the competition, the lower the RMSLE the higher the ranking among the participating teams, the generality of the model needs further proof. For this reason, cross validation is applied to the training data while test data is ignored since its actual sales value are not provided. Specifically, 5-fold cross validation is applied, which means each group of observations for each product in each store is partitioned into 5 disjoint subsamples (or folds),
  • 21. chosen randomly but with roughly equal size. Every time, 4 folds are used for training and last fold is used for evaluation. Predicted values for that last fold is created at the same time. This process is repeated 5 times, leaving one different fold for evaluation each time. The models used for training the data are the most effective ones generated in the sections 4.3.1, 4.3.2, and 4.3.3. RMSLE of each model is ranked in order to compare the effectiveness of prediction performance from cross validation with those that are submitted to the online competition. The results are as follows: testRMSLE ranking trainRMSLE ranking Stepwise Linear Regression 0.10477 5 0.129844 5 Ensemble Learning – LS Boosting -18 features 0.10388 4 0.122193 3 Ensemble Learning –Bagging -18 features 0.10142 3 0.105286 2 Ensemble Learning –Bagging -171 features 0.09907 2 0.1029 1 Linear combination of the previous 4 models 0.09818 1 0.123611 4 Table 12. Cross Validation From table above, we notice that linear combination of models does not work well for the cross validation (ranked number four out of five). If we ignore that last row, the rest four models share the same ranking in both the RMSLE for test data in the online competition and for cross validation in the training data. With these results, we are more confident in applying the best prediction model (Ensemble Learning –Bagging -171 features) to the analysis of inventory policy. 5.2 Evaluating Forecasts In this section, two common measures of forecast accuracy are applied to the predictions for the training data generated with cross validation from previous section. Specifically, these two measures are mean absolute deviation (MAD) and mean absolute percentage error (MAPE).
  • 22. To calculate these three measures, denote 𝑒𝑖 as the difference between the forecast value and actual value for each observation in the training data and suppose there are n observations. MAD and MAPE are calculated as: MAD = ( 1 𝑛 ) ∑ |𝑒𝑖|𝑛 𝑖=1 MAPE = [( 1 𝑛 ) ∑ |𝑒𝑖/𝐷𝑖|𝑛 𝑖=1 ] × 100% Because some products have a lot of days with zero sales, 𝐷𝑖 used in MAPE is replaced with average demand to avoid undefined values. Each of the above measure is applied to each product in each store. Since there are 255 combinations of different stores and products, 255 MADs and MAPEs are generated. It should be noted that in the original model that generates the best result, feature 18 which is average sales 7 days after today is included. However, when developing the inventory policy based on the predictions, the data for this feature is obviously not available in real life. For this reason, feature 18 and its interaction terms with other predictors are eliminated and a new cross- validated ensemble learning model is built with this new update. MADs and MAPEs are then calculated. It turned out that feature 18 contributes little to the original model and its elimination does not have significant influence on the original predicted value. To illustrate this point, the ranking of variables importance for predicting sales of product 23 in store 8 is shown as an example: rank variables importance rank variables importance rank variables importance 1 7 4.47E-04 31 102 4.43E-06 61 12 1.95E-06 2 78 5.98E-05 32 62 4.41E-06 62 37 1.75E-06 3 21 5.11E-05 33 17 4.28E-06 63 58 1.60E-06 4 3 4.66E-05 34 147 4.27E-06 64 76 1.57E-06 5 87 3.88E-05 35 98 4.15E-06 65 38 1.39E-06 6 63 2.41E-05 36 77 4.14E-06 66 35 1.08E-06
  • 23. 7 24 2.29E-05 37 138 4.14E-06 67 4 9.18E-07 8 5 1.98E-05 38 103 4.07E-06 68 39 9.00E-07 9 8 1.45E-05 39 42 3.92E-06 69 80 6.61E-07 10 66 1.28E-05 40 112 3.75E-06 70 55 6.43E-07 11 20 1.08E-05 41 28 3.68E-06 71 22 5.86E-07 12 29 9.34E-06 42 111 3.59E-06 72 101 5.59E-07 13 113 9.33E-06 43 43 3.58E-06 73 71 5.53E-07 14 83 9.29E-06 44 27 3.50E-06 74 132 4.33E-07 15 1 9.14E-06 45 53 3.31E-06 75 127 4.25E-07 16 2 8.70E-06 46 36 3.16E-06 76 44 4.12E-07 17 117 7.68E-06 47 70 3.14E-06 77 126 3.83E-07 18 133 5.70E-06 48 134 3.01E-06 78 89 3.67E-07 19 81 5.40E-06 49 88 2.95E-06 79 68 3.49E-07 20 108 5.38E-06 50 19 2.82E-06 80 41 3.46E-07 21 82 5.17E-06 51 11 2.75E-06 81 128 3.39E-07 22 143 5.04E-06 52 69 2.74E-06 82 10 2.48E-07 23 50 4.84E-06 53 18 2.57E-06 83 110 2.24E-07 24 33 4.76E-06 54 49 2.34E-06 84 6 2.23E-07 25 57 4.68E-06 55 99 2.25E-06 85 26 2.04E-07 26 56 4.67E-06 56 65 2.18E-06 86 64 1.73E-07 27 52 4.60E-06 57 139 2.11E-06 87 92 1.03E-07 28 48 4.52E-06 58 51 2.09E-06 88 93 8.48E-08 29 75 4.48E-06 59 104 2.07E-06 89 94 3.20E-08 30 34 4.47E-06 60 23 2.00E-06 Table 13. Variable Importance We see that feature 18 (the average sales 7 days after today) ranked 53 among all the features and it is about half as important as feature 17 (the average sales 7 days before today). Since MADs and MAPEs each has 255 values, it is not convenient to show them all in the report. Instead, the detailed values from the top 10 and the last 10 sorted with descending order according to the average daily sales for each product in each store will be shown while the rest of the values will be shown in the graphs to indicate the trend in MAD and MAPE. The tables and graphs are as follows: Top 10 in average daily sales: store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand 33 44 189903 914 36.219 0.115 207.771 16 25 135046 857 28.097 0.118 157.580
  • 24. 30 44 136473 868 26.824 0.317 157.227 17 9 135367 939 45.548 0.204 144.161 2 44 117125 875 21.016 0.120 133.857 4 9 117123 960 36.619 0.190 122.003 33 9 101586 914 36.785 0.227 111.144 25 9 98560 1011 28.217 0.157 97.488 34 45 87419 947 15.747 0.125 92.312 38 45 80068 875 15.488 0.130 91.506 Table 14. Top 10 in average daily sales Bottom 10 in average daily sales: store_nbr item_nbr sum of sales # of days recorded MAD MAPE mean daily demand 16 85 67 857 0.099 0.810 0.078 40 106 78 1011 0.093 1.049 0.077 9 105 73 947 0.099 0.884 0.077 22 104 68 898 0.094 0.883 0.076 38 86 62 875 0.088 0.929 0.071 25 84 69 1011 0.087 0.906 0.068 20 106 61 896 0.085 0.968 0.068 31 104 58 947 0.070 1.025 0.061 34 84 46 947 0.065 0.883 0.049 3 102 31 896 0.045 0.936 0.035 Table 15. Bottom 10 in average daily sales MADs for each store and item combination sorted according to its average daily sales sorted in descending order:
  • 25. Figure 6. MAD MAPEs for each store and item combination sorted according to its average daily sales sorted in descending order: Figure 7. MAPE 0.000 5.000 10.000 15.000 20.000 25.000 30.000 35.000 40.000 45.000 50.000 207.771 92.312 76.205 65.069 57.635 48.457 43.123 37.208 33.200 22.534 15.673 9.592 3.447 1.698 1.279 1.130 1.028 0.941 0.878 0.807 0.763 0.697 0.618 0.581 0.534 0.469 0.366 0.308 0.195 0.146 0.091 0.076 MAD Average Daily Sales for each store and item combination sorted in descending order 0.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600 1.800 207.771 97.488 79.357 69.669 63.240 50.127 47.751 41.432 37.208 34.464 26.010 17.442 12.299 5.267 2.845 1.628 1.279 1.151 1.081 0.980 0.902 0.828 0.792 0.749 0.697 0.622 0.597 0.547 0.500 0.435 0.360 0.305 0.195 0.151 0.099 0.078 MAPE Average Daily of Sales for each store and item combination sorted in descending order
  • 26. The above plots show that in general the MADs decrease with the number of average daily sales and MAPEs increase with number of average daily sales. For MAD, some models for store and item combination do not perform as well as others. This is particularly obvious for items with large volume of sales. For those models that do not perform as well, extra effort to fit a better model may be applied as a further approach. For MAPE, we can see a big jump from an average of around 0.1 to an average of around 0.4 when the sum of average daily sales drops to around five. Yet it should be noted that The MAPE is scale sensitive and should not be used when working with low-volume data because when the average demand is very low, the denominator in MAPE formula will often make MAPE take on extreme values. 5.3 Standard Deviation of Forecast Errors and its Implications for Safety Stock In general, forecasting error variance is higher than the demand variance since forecasting error also incorporates sampling error. If a forecast is used to estimate the mean demand, we keep safety stocks in order to protect against the error in the forecast6 . Thus, the standard deviation (STD) in forecast errors instead of standard deviation in demand should be used to calculate safety stocks. When the model is built at the very beginning, it used 5-fold cross validation which means each prediction group (generated by the model with data from the other four subsamples) accounts for only one fifth of the overall prediction. Thus, instead of calculating the standard deviation over all predictions, the average standard deviation of each of the five prediction groups should be used in order to comply with the cross validation method. Again, graph of averaged STDs against the mean daily demand for each of the 255 store and item combinations is shown below: 6 Steven Nahmias, Production and Operations Analysis (New York: McGraw-Hill/Irwin, 2009).
  • 27. Figure 8. Averaged STD Assuming overnight replenishment and 98% service level (which corresponds to a z-score of 2.05), daily safety stock is calculated as 2 × 𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑑 𝑆𝑇𝐷. the percentage of daily safety stock over average daily demand for each store and item combination is shown below: 0.000 10.000 20.000 30.000 40.000 50.000 60.000 70.000 207.771 92.312 76.205 65.069 57.635 48.457 43.123 37.208 33.200 22.534 15.673 9.592 3.447 1.698 1.279 1.130 1.028 0.941 0.878 0.807 0.763 0.697 0.618 0.581 0.534 0.469 0.366 0.308 0.195 0.146 0.091 0.076 averagedSTD Average Daily Sales for each store and item combination sorted in descending order
  • 28. Figure 9. Percentage of safety stock over average daily demand Part of the previous graph with only average daily sales above 5 products is shown below: Figure 10. Percentage of safety stock over average daily demand that are above 5 units 0.000 5.000 10.000 15.000 20.000 25.000 30.000 207.771 92.312 76.205 65.069 57.635 48.457 43.123 37.208 33.200 22.534 15.673 9.592 3.447 1.698 1.279 1.130 1.028 0.941 0.878 0.807 0.763 0.697 0.618 0.581 0.534 0.469 0.366 0.308 0.195 0.146 0.091 0.076 percentageofsafetystockoveraveragedailydemand Average Daily Sales for each store and item combination sorted in descending order 0.000 0.500 1.000 1.500 2.000 207.771 144.161 111.144 91.506 81.035 79.279 72.614 69.669 65.069 63.392 62.395 54.687 49.769 48.497 47.751 45.880 43.123 39.888 37.218 36.975 35.486 34.464 32.697 28.770 22.534 17.920 16.850 13.873 12.299 11.200 7.164 percentageofsafetystockover averagedailydemand Average Daily Sales for each store and item combination sorted in descending order
  • 29. From the plot, we notice that for the products that have average daily sales below five, the percentage of safety stock over average daily demand increase dramatically and has very unstable fluctuation. This situation poses a question of whether it is profitable to maintain those low demand products in stock since the number of safety stock for these products is much larger than its daily demand. However, similar to the problem with MAPE, when the average daily demand is very close to zero, its location in the denominator will often make the percentage take on very high values. This may partially account for the high spikes in the graph. 6 Conclusion For the first objective to fit an effective model in order to lower RMSLE in the test data, three different methods with different model parameters are sequentially tested. Stepwise linear regression provides the highest RMSLE among the three methods. K-nearest neighbors Search generates a better result, and ensemble learning provides the best prediction performance. Linear combination improves the prediction performance for the test data even further, although this combination cannot be applied generally which is indicated by its poor performance when tested with only the training data using cross validation. The variable importance implies that weather information is not significant in predicting the daily sales. Instead, features related to time contribute a lot more and rank among the top features in the importance ranking. Thus, although these products are assumed to be weather-sensitive, weather does not influence their sales as much as it is originally supposed. Future research on other machine learning techniques may further improve the prediction performance. However, the robustness of model should always be kept in mind when the prediction is going to be used in business activities such as setting up the inventory policy.
  • 30. The second objective allows us to dive into the implications from the predictions. With cross validation, ensemble tree model proves its robustness. It is natural that MAD decrease with average daily demand, yet the products with rather large MAD compared to those having similar average daily demand may require more attention for further model improvement. In addition, the two spikes in MAPE before the aforementioned jump at around 5 average daily sales impose concern. The models for these two spikes should be further tested with other machine learning techniques. Finally, the calculated safety stock and its value as percentage of average daily demand poses a question of whether the products is profitable to be maintained on the store shelves. Although no further data is provided, inventory costs such as holding cost of high inventory, the obsolescence cost, the ordering cost, the storage space costs, and the transportation costs for those products should all be taken into account when more detailed information regarding those products become available. 7 References “Data - Walmart Recruiting II: Sales in Stormy Weather | Kaggle.” Accessed December 9, 2015. https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/data. “Create Linear Regression Model Using Stepwise Regression - MATLAB Stepwiselm.” Accessed December 10, 2015. http://www.mathworks.com/help/stats/stepwiselm.html. “Classification Using Nearest Neighbors - MATLAB & Simulink.” Accessed December 10, 2015. http://www.mathworks.com/help/stats/classification-using-nearest-neighbors.html. Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of Boosting Papers.” Ann. Statist 32 (2004): 102–7. “Ensemble Learning - Wikipedia, the Free Encyclopedia.” Accessed December 9, 2015. https://en.wikipedia.org/wiki/Ensemble_learning.
  • 31. Nahmias, Steven. Production and Operations Analysis. New York: McGraw-Hill/Irwin, 2009. 8 Appendices 1. Linear regression model of 1 k nearest neighbors and 1 ensemble learning: y ~ 1 + x1 + x2 Estimated Coefficients: Estimate SE tStat pValue ________ _________ ______ ___________ (Intercept) 0.24598 0.044419 5.5377 3.0688e-08 Ensemble learning 0.85715 0.0058669 146.1 0 K-nearest neighbors 0.21063 0.0055461 37.977 1.2375e-314 Root Mean Squared Error: 18.8 R-squared: 0.773, Adjusted R-Squared 0.773 2. Linear regression model of 1 stepwise linear regression and 3 ensemble learning: y ~ 1 + x1 + x2 + x3 + x4 Estimated Coefficients: Estimate SE tStat pValue ________ _________ _______ __________ (Intercept) -0.18353 0.033454 -5.486 4.1145e-08 x1 0.2025 0.0043553 46.495 0 x2 0.33965 0.0049085 69.195 0 x3 -0.36155 0.017711 -20.414 1.5038e-92 x4 0.8802 0.017254 51.014 0 Root Mean Squared Error: 14.3 R-squared: 0.868, Adjusted R-Squared 0.868