This document is a thesis submitted by Jinxing Lin to Cranfield University in partial fulfillment of a Master of Science degree. The thesis investigates applying machine learning techniques for sales forecasting. It includes a literature review covering machine learning algorithms that have been applied for sales forecasting, such as regression trees, support vector machines, neural networks, and extreme learning machine. The methodology section describes the data source and preparation, as well as techniques to be applied including random forest regression, time series forecasting, and evaluating results. The thesis aims to study machine learning algorithms and apply them to a dataset to perform sales forecasting.
1. CRANFIELD UNIVERSITY
Jinxing Lin
Application of machine learning
techniques for sales forecasting
School of Aerospace, Transport and Manufacturing
Software Engineering for Technical Computing
MSc. Thesis
Academic Year: 2014-2015
supervisor: Irene Moulitsas
June 23, 2015
5. Abstract
Nowadays, sales forecasting problems have been considered more and more im-
portant by companies and industries. An accurate prediction can significantly help
companies and industries understand the future trend of products and therefore
make a better sales plan, prepare the production, the stock and the transport
of products. All these improvements lead to minimise the cost while satisfying
clients’ demands. In other words, supply chains can be well improved with precise
sales forecasting.
TODO: techniques
TODO: conclusions
3
7. Chapter 1
Introduction
Recently, Information Technology (IT) is playing a more and more important
role in supply chain management (SCM), from basic IT infrastructure in a company
to Virtual Enterprise, from storage of data to analysis of data and even sales
forecasting [5]. An accurate sales forecasting can be very useful and helpful for
companies to make their decisions on the production planning as well as the sales
price. It can also help companies effectively distribute their resources, reduce
unnecessary cost and provide satisfactory customer service. Meanwhile, the sales
forecasting is affected by plenty of factors such as lifespan of product, economic
climate, competition and globalisation. In the last twenty years, a lot of researches
which aim to improve sales forecasting have been developed and most of them are
based on machine learning techniques.
First of all, let’s focus on what machine learning (ML) is. In recent years, ML is
absolutely one of the hottest topic in computer science. Even non-technical people
can’t have escaped the articles, headlines, videos, TV programmes on the rise of
big data [13, 19] and machine learning. Briefly introduced, big data is datasets
which should fit the following ”3V”:
ˆ Volume: The volume of data is so important that it determines the value of
datasets. As a matter of fact, traditional data treatments based on only one
machine aren’t suitable for the increase of volume of data any more. Thus,
new scalable data treatments are developed to fit the increasing volume of
data.
ˆ Variety: The variety of data means different forms of data in this context. Big
data can be data stored in structured database, text, images, vocal messages,
videos. And it can be collected from various sources such as internet, mobile
phone, personal PC, wearable devices.
5
8. ˆ Velocity: Big data is the data which is generated very rapidly. For example
during each trading session, the New York Stock Exchange obtains 1 TB of
trade information.
The volume of global data is increasing dramatically and it brings more and more
challenges in looking for the suitable technologies for supporting this extreme
growth. As shown in the figure 1.1 below, in 5 years, the volume of data will
be 10 times as what it was 2 years ago. Therefore, the research for appropriate
supporting technologies is becoming urgent. The adaptive technology should be
capable to store large volume of data such as several TB; it should be capable to
calculate in parallel which allows to significantly accelerate the computation and
provide high performance; in addition, it would be an attractive point if the tech-
nology support data streaming. Two most used platforms designed for big data
are Hadoop and Spark. These two platforms use different high performance com-
puting techniques to distribute data and jobs. A brief presentation about these
two technologies is included in the following chapter.
Figure 1.1: Growth of global data 1
The use of big data is expanding rapidly and it’s getting into every aspect of our
lives. For example, the police predicts when and where crimes happen based on big
data; health center uses big data to predict the coming of diseases; travel agencies
use big data to understand customers’ preferences and elaborate attractive travel
1
Source: http://www.emc.com/leadership/digital-universe/2014iview/
executive-summary.htm
6
9. plans; accurate stocks forecasting is becoming possible with big data. Our lives are
now surrounded by big data and in order to make sense of big data, to understand
its hidden brilliant values, we use machine learning algorithms to explore it.
Machine learning brings both computer science and statistic together in order
to learn from the data and find out suitable patterns or models for the data. With
these patterns or models, it is possible to project them to the future and therefore
predict. Globally, machine learning techniques can be categorised by their purpose:
ˆ classification: training a model for assigning correctly observations into their
classes such as classifying if a customer is loyal or not;
ˆ regression: training a model for predicting continuous output such as stocks
forecasting;
ˆ clustering: without knowing the groups beforehand, training a model for
splitting input observations into groups such as image recognising.
Nowadays, ML is widely used by all kind of companies. Google is using ML to
construct it’s search engine; Facebook is using ML to recommend friends or ad-
vertisements to us; Tesco is using ML to distribute coupons to customers; weather
forecasting is using ML to forecast the coming days’ weather.
After talking about big data and machine learning algorithms, certainly this
thesis is about discovering and applying machine learning techniques. There are
so many domains that machine learning can be used in and among all these, we
find it interesting to investigate how machine learning can play in sales forecasting
which is a highly focused subject in business analysis. The aims of this thesis are:
ˆ studying machine learning algorithms and understanding their mechanisms;
ˆ analysing the given dataset, elaborating a plan of how the data should be
used;
ˆ looking for pertinent patterns for the dataset and make sales forecasting.
For achieving this objective, we will first discover different machine learning al-
gorithms deeply, choose some interesting ones which might satisfy the dataset
and employ them, customise these techniques in regards with the data and then
compare their performance.
7
10. The following chapter is literature review. This literature review is mainly about
some machine learning techniques which have been applied in sales forecasting and
some commonly used technologies for machine learning. In chapter 3, we will de-
scribe where the data comes from, how we pre-process the data and which ML
algorithms we apply and why we choose them. After the chapter of methodology,
results of different algorithms will be shown with diagrams and charts in the chap-
ter 4. The before last chapter will mainly include the analysis of results which
mean the comparisons of the performance of different techniques. A conclusion of
the research will be included in the last chapter in which we summarise the global
process of research and the future expectation for continuing this research.
8
11. Chapter 2
Literature review
2.1 Machine Learning Techniques for sales fore-
casting
The exponential smoothing model [4] is one of the earliest models applied in sales
forecasting as well as the AutoRegressive Integrated Moving Average (ARIMA)
model [21]. Forecasting models like Neural Networks (NN) model [21] and fuzzy
model [12] were also often applied for sales forecasting. As the volume of data
is growing rapidly and the request for accuracy is getting higher, some new ma-
chine learning techniques are used on sales forecasting of different type of data.
For example, clustering [16] and decision trees [16] are used to develop a sales
forecasting system of the Textile-Apparel-Distribution; in order to forecast sales
of a new appeal item, neural clustering and classificaiton [17] are applied. In the
recent years, Extreme Learning Machine (ELM) has been applied very frequently
in combination with other techniques to sales forecasting.
The following sub-sections present some interesting techniques that would prob-
ably be applied or tested during this research.
2.1.1 Regression Trees
Before introducing regression tree, it is important to clarify that decision tree is
a predictive model which can be used to evaluate the value of a certain feature of
the system from the observation of the other features of the system. Regression
tree is a form of decision tree which is specific for numeric data. Basing on the
given data, it trains a tree model whose leaves represent groups of instances in the
same class and whose branches represent the separation of instances into different
groups (2.1). It is a supervised learning method that trains its model with known
9
12. inputs and known outputs. Once the regression is constructed, we can input
some new data and the regression tree will train the given new data and output
the prediction. Regression tree has been proved to be an efficient tool for sales
forecasting in the textile distribution [16].
Figure 2.1: Regression tree
2.1.2 Support vector machines
Support vector machines are machine learning models used for regression and
classification. The principal of this model is to create one or a set of hyperplane(s)
between different classes and the request of hyperplanes is to maximize the mar-
gins between classes so that obtain the lowest error of the classifier.Let’s take an
example which is shown in figure 2.2. A hyperplane in geometry is a subspace of
one dimension less than its ambient space. Often, it is not feasible to separate
the data on the original dimension. In this case, projection of the data onto a
higher dimension would be applied. In the paper [1], a hybrid system of RNN
and SVM was applied to forecast the sales and the results showed that the hybrid
system outperformed the simple traditional forecasting techniques such as moving
average.
1
Source: http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/
introduction_to_svm.html
10
13. Figure 2.2: SVM with hyperplane 1
2.1.3 Stochastic gradient descent
Sometimes, an optimisation of training model would be very useful to make the
model more efficient to predict. An often used optimisation algorithm is Gradient
descent (GD) [9]. It aims to find the minimum of a function. In order to do this,
it follows the opposite direction of the gradient at each point and it takes steps.
Here, gradient represents the direction in which a function increases the fastest.
A training model, for example a linear regression model, has its cost function,
and gradient descent method is applied to train this model and minimise its cost
function.
Gradient descent method can be roughly divided into two methods. batch gradi-
ent descent (BGD) and stochastic gradient descent (SGD) [3,22]. BGD, which has
many iterations and in each iteration, trains all the data to calculate the gradient.
To be different, in each iteration, SGD only picks a random example to train the
gradient. As the training data size in each iteration is smaller in SGD than in
BGD and SGD algorithm does not need to remember the examples which have
been previously studied, SGD is often considered as a more effective method. In
the research [?], Dr.Bottou studied the application of SGD over several machine
learning system such as K-Means, SVM, Lasso and it showed that they performed
better with SGD.
2.1.4 Neural networks
In terms of neural networks, there are various types and the one that has been
mostly applied is feed-forward and error back-propagation neural network [11,21].
11
14. There are at least two layers in neural networks: the input layer and the output
layer. One or some hidden layers could be inserted between the input and output
layer and each layer contains its elements. As shown in Figure 2.3, the data
will be input into the input layer and pass through hidden layers where training
algorithms are applied and finally be sent to the output layer. The information
only carries on forward in this type of neural network. The network adjusts its
weight by using the information fed back by the comparison of the real values and
the correct answers.
Figure 2.3: Artificial neural network
2.1.5 Recurrent neural networks
Recurrent neural networks (RNN) is developed basing on neural networks. The
main difference between them is that RNN allows some outputs of neurons to go
back and become inputs of other neurons. It means that the information is not
flowing in only one direction and the information can be trained several times
before being output as indicated in Figure 2.4. Owing to this fact, RNN can
perform better than one-directional neural network, for example, in stock price
forecasting [1,8].
2.1.6 Extreme learning machine
Extreme Learning Machine (ELM) has been applied to fashion supply chains
for sales forecasting [14,20]. Extreme learning machine is an algorithm for single-
hidden-layer feedforward neural networks (SLFN). In this algorithm, the input
12
15. Figure 2.4: Recurrent neural network
weights and the hidden biases are randomly determined and on the other hand,
ELM determines analytically the output weights with Moore-Penrose (MP) gen-
eralized generalized size. In comparison with the traditional gradient-based algo-
rithms, ELM is more rapid and more effective. In addition, it can avoid some
problems, such as stopping criteria, learning weight and learning epochs, faced by
gradient learning algorithms. The experimental results in [14] demonstrates that
the performance of ELM model is superior to some sales forecasting algorithm
based on backpropagation neural network. In some studies, ELM is combined
with other machine learning techniques to build forecasting models [2,10,20].
ELM and harmony search algorithm
In the research [20], Harmony Search (HS) algorithm was applied with the com-
bination of ELM for sales forecasting in fashion retail supply chains. Harmony
search algorithm can be integrated with ELM to construct a novel meta-heuristic
optimisation algorithm which we can obtain optimal NN weights and have high
forecasting performance with. According to the research [20], this hybrid intelli-
gent system significantly outperforms traditional ARIMA models as well as two
other developed neural network models for fashion sales forecasting.
ELM and ensemble empirical mode decomposition
Empirical mode decomposition (EMD) is a signal processing technology based
on the local characteristics time scales of a signal and it is usually applied to
decompose a signal into intrinsic mode functions (IMF) which are finite and small
13
16. number of components. In order to avoid the major problem of EMD which is mode
mixing problem, ensemble empirical mode decomposition (EEMD) is developed.
EEMD consists in EMD method and a noise-assisted data analysis method and
it is designed to alleviate the mode mixing problem. With EEMD, original sales
data can be converted into IMFs and the latter will be input into ELM method
to forecast sales for computer products [10]. In comparison with single ELM,
single support vector regression (SVR), single back-propagation neural network
(BPN), EEMD-ELM model is better in terms of performance in developing the
sales forecasting of computer product.
ELM and Gray relation analysis
Gray relation analysis (GRA) can be integrated with ELM to set up a hybrid
sales forecasting system (GElM). GRA measures the relative distance, between
compared series of data and reference series of data, which is called Gray relation
grades (GRG). The ranking of the GRG can show that which factors that affect
the sales amounts the most and these influential factors are used as input variable
of ELM models. The results of experiments show that GELM hybrid system
outperforms BPN and Military Families Learning Network (MFLN) models [2].
2.2 Supported technologies
Nowadays, as we are in the age of big data, more and more technologies designed
for big data have come to light. In this section, we are going to present some
technologies which have been used the most recently and which we might use in
this thesis project.
2.2.1 Apache Hadoop
Apache Hadoop [15] is a platform which is designed for distributed storage
and distributed processing. It is a platform dedicated to large volume datasets.
Basically, Hadoop gets data files as input and it decomposes data into large blocks
and distributes them to all the nodes in the cluster. According to different data
received by different node, these nodes will receive a packaged code and then
execute the computation in parallel. This fact can largely improve the efficiency
in comparison with computing all data on only one node.
In terms of component, there are four main components in Hadoop:
ˆ Hadoop Common: contains library and supports other Hadoop modules;
14
17. ˆ Hadoop distributed file system (HDFS): provides high throughput access and
performs the best with large files;
ˆ Hadoop Yet Another Resource Negotiator (YARN): a distributed resource
scheduler;
ˆ Hadoop MapReduce: a distributed processing framework which decomposes
work into small parallelized map and reduce worker.
2.2.2 Apache Spark
Apache Spark is cluster computing platform which is built for training large
dataset. Comparing to one of its competitor, Hadoop, Spark extends the MapRe-
duce model (used by Hadoop) to support more types of computation. MapReduce
model splits data to discs while data in Apache Spark is split into buffer cache
(memory) and because of this, Apache Spark outperforms Apache Hadoop in most
cases in terms of speed.
Basically, Apache Spark consists of two principal parts: a management system
and a distributed storage system. The management system can be Spark stan-
dalone or Spark pseudo-distributed mode or Hadoop YARN or Apache Mesos.
Same for the distributed storage system, there are a lot of choices such as HDFS,
Cassandra, Amazon S3.
The platform is written in Scala and its compatible with several languages:
Scala, Java, Python and SQL. The high compatibility with the main data process-
ing languages help Apache Spark become more and more popular.
The main characteristic components of Apache Spark are:
ˆ Spark Core which is used for task scheduling, memory management, fault
recovery, and providing API to create and manipulate Resilient Distributed
Datasets (RDD, a collection of objects distributed into computation nodes
in order to be computed in parallel);
ˆ Spark SQL which supports a new data abstraction for structured data:
SchemaRDD;
ˆ Spark Streaming which allows to manipulate data stream;
ˆ MLlib, a Machine Learning (ML) library, which provides some common Ma-
chine Learning algorithms;
ˆ GraphX which is used to perform distributed graph computations.
15
18. Since this thesis is mostly about performing machine learning algorithms on large
volume of data, Apache Spark is then chosen as the cluster computing platform
where we submit our computing tasks.
2.2.3 Scala
Scala is a programming language based on JVM compiler and it supports multi-
paradigm at the same time: object-oriented programming and functional program-
ming. As a fusion of two programming concept, Scala allows to build structures and
elements that treat computation as calling mathematical functions. Meanwhile,
it allows to build large system with component abstraction and legible structure.
Moreover its open-source which makes it easier for user to use it.
As Apache Spark is compatible with Scala and Scala is less verbose in comparison
with Java and Python, we chose Scala as our main programming language.
2.2.4 R
Spark is an enterprise-used level development tool, as we mentioned before, it’s
very efficient for applying basic machine learning algorithms provided in MLlib.
But facing deep learning algorithms, it may not be a very smart choice. Nowadays,
”GPU + CUDA” is becoming the main architecture for data scientists to build
and run their deep learning programs thanks to the high performance of the way
that GPU processes tasks.
On the other hand, spark is not a good choice for building customised machine
learning algorithm either since it doesn’t contain matrix data structure and most
of machine learning techniques require a lot of use of matrices.
In this project, some deep learning techniques as neural network and extreme
learning machine are expected and building customised algorithms is indispens-
able so that we need another tool to perform these deep learning techniques and
constructing our own algorithms. After doing some researches on different alter-
native tools, we finally decide to choose R. R is a programming language which
provides a numerical computing environment (supports matrices arithmetic) and
supports performing computations on CUDA GPUs. This fact can possibly speed
up the computations. Moreover, thanks to a recently released package - SparkR,
it is possible scale R programs in Spark (a distributed fashion).
16
19. Chapter 3
Methodology
3.1 Data
3.1.1 Data source
For examining the effectiveness of different machine learning algorithms in sales
forecasting, a french surgical equipment company Didactic provides a dataset. This
dataset consists of monthly sales of some surgical equipments in the period from
October 2013 to Mars 2015. 758 different products are included in this dataset.
Each product has its own reference, its daily sales quantity corresponding to each
client and its price. In addition, there is also an available dataset of call to tender
of surgical equipments in the market. The aim is to build sales forecasting model
with information provided in these two available datasets.
3.1.2 Data preparation
First of all, the program reads the data from a data file which contains all the
sales records of Didactic. Since in the original data file, the data is separated by
product, by order and by day and what we want to obtain as result is the total
quantity of daily sales of each product, a sum up of all the sales’ quantity per
product per day is essential. The following table (Table 3.1) is an example of the
data after the sum up of daily data:
According to marketing experts in Didactic, seasonality might be an important
factor which affects the sales. It means that the sales might vary between a certain
range in accordance with the different months in a year. Following this marketing
assumption, we will also analyse the sales changes month by month. Therefore, a
sum up of sales’ quantity per product per month is needed. You can find out an
example of the treated data as follows (Table 3.2):
17
20. reference year month day quantity
1312114 2014 11 13 34
1312114 2014 11 14 26
... ... ... ... ...
1312115 2014 11 13 15
Table 3.1: Prepared data: daily sum
reference year month quantity
1312114 2014 11 543
1312114 2014 12 614
... ... ... ...
1312115 2014 11 317
Table 3.2: Prepared data: monthly sum
Based on the assumption of seasonality, we supposed that there is a relation
between the average quantity of four previous months’ sales and the quantity of
the following month’s sales. Hence, a computation of the average quantity of four
previous months’ sales is request. After another data preparation process, we get
data as follows (Table 3.3):
reference year month quantity avg4premonths
1312114 2014 11 543 517
1312114 2014 12 614 627
... ... ... ... ...
1312115 2014 11 317 226
Table 3.3: Prepared data: monthly sum and average of four previous months’ sales
3.2 Random forest with regression tree
3.2.1 Regression tree
As we have mentioned in the literature review, regression tress are methods
commonly used to train models with given predictor variables x and a continuous
response y. With these models, we can predict the value of y for a new given
value of x.
18
21. In the essence, the regression tree algorithm analyses the inputs and start the
creation of tree from the top (root node) to the bottom (leaves). The process of
construction of a tree is a recursive partition. Each node works like a filter. This
filter is a question relative to a particular feature and it filters the input instances
into sub-nodes or leaves. Actually, those conditions are based on the features of
instances. For example ”Is price > £20” or ”Is lifespan < 5 months”. And leaves
are the final groups of instances which lead to similar outputs. A point x belongs to
a leaf if x is assigned to the corresponding cell of the partition. Trees are growing
exponentially with the growth of its depth. Once the partition of instances is done,
an average of the target value in all instances will be calculated for each leaf and
this average will be the prediction value for all the instances distributed into this
group. Then, when we input a new instance of features into this model, it will be
going down along the tree till it reaches one of the leaves. And it will carry the
prediction target value which is assigned to that leaf.
One of the most important advantages of regression tree is its interpretability.
As it contains a question relative to a certain feature in each node, the whole
model will be very simple to understand, even by people who don’t have machine
learning background. It can be considered as a tree contains a lot of conditions.
Once a condition is satisfied, the instance goes to one side, if it isn’t satisfied,
then it follows the other branch. No matter how big the tree is, it is still very
understandable. Once the model is built, it will be fast to make prediction because
there is no complex calculation to execute.
Implementation
As what has been mentioned in the literature review, Spark provides library of
machine learning algorithms which includes regression tree algorithms. We apply
the regression tree technique to two main data models.
The first model is based on the hypothesis that the future sales only depend on
the the year and the month. Actually, we consider it like a time-series model. In
a time-series model, time (or date in this case) plays the key role and the other
factors such as price changes are secondary factors. A secondary factor can also
affect the sales but only over a small portion and this small portion is considered
as noise. More details about the time-series model will be presented in the next
section. After all, in this first model, we have year and month as features and
quantity as target. The regression tree algorithm analyses the inputs and build
tree like the following example 3.1:
19
22. Figure 3.1: Regression tree with year-month model
In this graph, we can observe that year and month are used as features. In
each node, there is a question relative to one of these two features and it splits the
input instances into sub-partitions. At the bottom, we obtain groups of instances
which we call them leaves. There is a quantity value assigned to each leaf and this
value is the average of the quantity of the instances fallen in that leaf. After the
construction of regression tree, if a new instance of features comes in for the sales
forecasting, for example (year = 2016, month = 2), this instance will go from the
top to the bottom of the tree and fall into a leaf. Then it takes the qty of that
leaf as it’s predictive target value.
Comparing with the first model, the only difference in the second model is that
it contains one more feature - qty4 (the average of 4 previous months’ sales). It
uses exactly the same way to train the regression tree and to predict the target
value of new given instances.
Results
TODO put tuning algorithms’ results
3.2.2 Random forest
Overfitting the training dataset is a problem often happens to regression trees.
Owing to this fact, random forest is introduced to correct it. Random forest
method was proposed and developed by Leo Breiman in 2001 [?]. It represents
20
23. a family of ensemble learning methods for classification and regression. These
methods utilise exclusively decision trees as classifiers and output either the class
or the mean prediction of each individual tree. During the process of construct-
ing decision trees, we introduce a random factor with the use of Bagging and
Random Feature Selection. In random forests, all the decision trees operate
independently. Each decision tree is built based on a random vector of parame-
ters. For example the kth
tree in the forest depends only on the vector θk and it’s
independent of other vectors. All the trees participate in the final decision. Here
is a graphical presentation of random forest structure 3.2:
Figure 3.2: Random Forest
Bagging
Bagging is a method selecting a subset of the training data for the construction
of each decision in the random forest. These sub-datasets are called bootstrap.
Bagging repeatedly and randomly picks a random sample with replacement from
the training dataset. TODO put a figure to demonstrate Bagging
Random Feature Selection
Random Feature Selection focus on the features of data. It actually selects
randomly a fixed number of k features and then among these k chosen features,
select the feature which optimise the partition of data. The way to define the
optimal feature is to compare the impurity of data assigned to sub-nodes. TODO
put a figure to demonstrate RFS
21
24. All in all, bagging is applied to construct random sub-dataset for each decision
tree while random feature selection is used to select optimal feature among a sub-
group of features during the learning process of each tree. Both these two methods
lead to a better machine learning model which does not overfit the input dataset.
Implementation
Random forest algorithm is provided by MLlib and the way to call it is quite
similar to the way of calling regression tree. In order to get the best parameters
for our dataset, we tune the algorithms with different values for each parameter
such as the number of trees in the forest, the maximum depth possible of a tree,
the maximum number of intervals possible of a feature.
Result
TODO put tuning results here
3.3 Time-series forecasting
After applying directly some existing machine learning algorithms over the data,
we do get some results which are not bad. But in order to get more control over
the learning methods, we decide to build a suitable algorithms for this use case and
this dataset. Since the main data that we got is related to date and the marketing
experts told us that seasonality might affect the sales, we decide to build a time-
series model. A time-series technique looks at only the patterns of the history of
actual sales and based on these patterns, predict the future sales [18].
3.3.1 Main components in time-series model
There are four main components we have to take into account when we are
trying to set up a time-series model:
ˆ level: a horizontal sales history;
ˆ trend: a pattern that represents continuously the sales increase or decrease;
ˆ seasonality: a pattern that represents repeatedly how sales increases and
decreases within a certain period (e.g. one year);
ˆ noise: a random fluctuation which might be explained by some features
expect times such as price changes or the quality of customer service.
1
Source: [18]
22
25. Figure 3.3: Time-series components 1
3.3.2 Implementation
Visualisation of data
In order to get a better insight of the data, we have decided to visualise the data
in the form of line charts. The first line chart is about the monthly total sales of
all the products in Didactic. Actually, we want to have a global view of how the
sales varies throughout the time. With this view, we can roughly see if there is a
trend or seasonality in the products’ sales in this company. For creating this line
chart, we go through each month and sum up all the sales’ quantity. Then we
display these sums with their corresponding date so that we can see how the sales
evolve with time. Put graphs here
After seeing it globally, we also need some line charts which provide some more
details. In fact, the time-series models are usually constructed for a product or
a family of similar products. Due to this fact, we decide to draw a quantity-date
line plot for each of the 20 products who have been sold for the longest time. We
take these top 20 products because they have the most sales history and this fact
allows us to have a more clear view on how the sales evolves with time. These plots
for different products are showing the evolution of daily sales, so they often have
sudden peaks or drops (short-term fluctuations). For decreasing the unexpected
effect of the short-term fluctuations, we apply the technique - moving average.
Moving average is a technique used to smooth out short-term fluctuations and
emphasise long-term trends or periodical cycles. Essentially, for each point of
data, it creates a small subset which contains data around that point and then
23
26. calculate the averages of data in this subset. These averages become the new
values of points. Put graphs here
Proposed models
There are many possible approach to construct a model for time-series problem.
We propose here two different models consisting of trends and seasonal coefficients:
ˆ Multiplicative model;
ˆ Addictive model.
Multiplicative model This model is based on the assumption that target values
can be predicted with the multiplication of values on a trend and relevant seasonal
coefficients. The following figure 3.5 and formula are one of the multiplicative
model:
Salest = Trendt ∗ SAt + Noiset
where SA = SeasonalAdjustement and t = time(day/week/month/etc). In this
model, we consider that a proportional relationship exists between target values
and time and values over the trend. Once we have calculated seasonal coefficients
and the trend, we can then project these two to the future and forecast a target
value for a given input.
Addictive model Apart from the multiplicative model, we employ an addictive
model as well. The difference is that in this model, we assume that a target
value can be calculated by the addition of a value on the trend and a relevant
residue. For each observation, there is a corresponding residue and there is a
relationship between these residues. Since the fact that the residues are related to
each other, the objective is to compute the seasonal coefficients based on residues
in each observation and predict the future residues and therefore obtain future
target values. As shown in the figure ?? and the formula:
Salest = Trendt + Residuet + Noiset
where
Residuet = Residuet−1 ∗ SAt
where SA = SeasonalAdjustement and t = time(day/week/month/etc). , we
can observe that the sales quantity is an addition of the trend and the residues
and each residue is calculated from its previous residue. In comparison of the
multiplicative model, there is a shortage of this model. It might propagate the
errors if we want to predict the values for more than one point in the future. Since
24
27. Figure 3.4: Multiplicative model
each residue is computed from its previous one and if there is an error somewhere,
it will propagate with the process. Therefore the further is the point that we want
to predict, the bigger might be the error.
Calculation of trend
A trend of a product is the pattern shows gradually how the sales increases or
decreases. It can be either a line which fits a linear function or a curve. If we
have a suitable pattern for a product, we can know roughly which direction is
the growth of product going to and how fast is it going. We mainly apply three
regression methods to look for the the trend:
25
28. Figure 3.5: Addictive model
ˆ linear regression;
ˆ kernel regression;
ˆ LOcal regrESSion (LOESS regression).
Among these three regressions techniques, only the linear regression method is a
parametric method. The other two are both non-parametric methods.
Linear regression Y = α ∗ X + β is a very simple formula that everyone is
similar with and the linear regression is an approach for defining the relationship
between a target variable Y and one or several explanatory variables X based
26
29. on this simple formula. The aim is to calculates the correct values for α and β
with which it leads to smallest difference between the expected values and the
calculated values. In our case, the target variable is the sales’ quantity and there
is only one explanatory variable which is the date. Therefore, the linear model
that are looking for is as follows:
Quantity = α ∗ Date + β
where Date = Y ear +Month/12 (transforming the date into a number). Once we
get the value of α and β, we can set up the linear function. Input a new value of
Data to this function, we can expect an output of Quantity. put graph here
Kernel regression As a non-parametric method, kernel regression aims at find-
ing a non-linear relation between an explanatory variable X and a target variable
Y . The regression function for Y about X is
m(x) = E(Y |X)
where m(x) is the estimator of the regression function. Meanwhile, there are differ-
ent available estimators for the regression function and we only pick a commonly
used one - Nadaraya-Waton estimator:
ˆmh(x) =
n
i=1 Kh(x − xi)yi
n
i=1 Kh(x − xi)
where k is a kernel which is used as a weighting function and h is a bandwidth. For
example, K(x) = 1√
2π
e−(x)2
/2
is a commonly used kernel function. The estimator
function can be transformed into
f(x) = n−1
h−1
i=1
nK(
x − x−1
h
).
Let’s take an example where X has n points. At each point, the kernel regression
technique takes n∗h points around that points and applies the weighting function
K(u) on their Y values. Then we can take the average of these weighted Y values
and the average will become the new target value of that point. In the essence, it
is smoothing the values with a kernel function. Thus it can build a model which
fits well the given data. put graph here
LOESS regression The other non-parametric technique we employ is LOESS
regression. Basically, as in Kernel regression, an action is operated at each point.
LOESS regression fits a low-degree polynomial function to a subset of the data at
27
30. every point in that data set. A subset of data is the neighbourhood points of a
point. The way to fit the polynomial is using weighted least squares with which
the central points gain more weight and the further points on both sides gain less
weight. It is also sort of data smoothing. The size of subset of data is determined
by the bandwidth as in the kernel regression. In comparison with other regression
methods, LOESS regression doesn’t need a specific function to a model, it only
applies polynomial on each subset of data. Moreover, its flexibility makes it one
of the best choices for sophisticated data models. In the other hand, it is a very
computationally expensive technique. put graph here
Calculation of seasonality
As we have mentioned before, in time-series models, there is another component
which varies the values and that is the seasonality. Some products are well relative
with seasons or months or weeks or even days. For adjusting the model to this kind
of products, we need to introduce the seasonal adjustments. Seasonal adjustments
are in fact coefficients which are calculated for each time-period and which the
model can make its values more adaptive to real data. We use two approaches to
compute these coefficients for each month in a year.
Let’s assume that there might have a relation between the current month’s sales
quantity and the next month’s sales quantity. It can be illustrated by the following
formula:
Salesm = SAm ∗ Salesm−1
where SA = SalesAdjustment. We need make to add a new column consists of
previous month’s sales quantity to the data table as shown below 3.4 so that we
can compute the coefficients for each month.
reference year month quantity quantity1
1312114 2014 11 534 556
1312114 2014 12 614 534
... ... ... ... ...
1312115 2014 11 317 289
Table 3.4: Prepared data: seasonal coefficients computation solution 1
In this case the first month of each reference can not be taken account into the
calculation of seasonal coefficients since they don’t have previous month’s sales.
28
31. In the second solution, the hypothesis is that a relationship exists between the
current month’s sales quantity and the average of its 4 previous months’ sales
quantity. Hence it drives to:
Salesm = SAm ∗ avg(Salesm−1 + Salesm−2 + Salesm−3 + Salesm−4)
.
As the previous solution, we need to transform the data table. After the
transformation, we have a table as follows 3.5:
reference year month quantity quantity1 quantity2 quantity3 quantity4
1312114 2014 11 534 556 515 478 533
1312114 2014 12 614 534 533 515 478
... ... ... ... ... ... ... ...
1312115 2014 11 317 289 313 337 329
Table 3.5: Prepared data: seasonal coefficients computation solution 2
As a result of the need of 4 previous months, we cannot use the first 4 months’
sales to compute the coefficients.
Clustering
We meet a difficulty in terms of data, actually the data we have is not enough
to train one series of seasonal coefficients for one product. For each product,
the maximum historical sales we have is of 1 year and a half. For some recently
come out products, we only have less than one year historical sales. As there are
less than several years sales records for each product, if we calculate the seasonal
adjustments based on only one product, these coefficients are going to overfit the
data. To be more detailed, the model will consider that sales variation in next
year will be exactly the same as this year and this is quiet rare in the reality. To
deal with this problem of lack of information, we brought in a machine learning
technique - clustering. The objective is to use clustering to gather products have
similar variations about time and put them in the same subset. Then we compute
a series of seasonal coefficients for each subset rather than a series of seasonal
coefficients for each product.
Let’s focus on the mechanism of clustering techniques. Clustering is an unsu-
pervised learning method organising a group of objects that share similar charac-
teristics. Imagine we have a large set of data and we want to split them into some
subsets of data so that we can find learn from each subset instead of learning from
just one huge set of data since learning from a smaller set of data is sometimes
29
32. easier to obtain expected information. The following figure 3.6 shows the partition
of data into clusters:
Figure 3.6: Clustering
Clustering which aims to find structure within a given set of data can be applied
in this case. There are some commonly used clustering models [6] such as:
ˆ Centroid models: k-means;
ˆ Connectivity models: hierarchical clustering;
ˆ Distribution models: Expectation-maximisation;
ˆ Graph theory model: Highly Connected Subgraph (HCS) algorithm.
Among these techniques, we are interested in the application of k-means algorithm
and hierarchical clustering.
K-means The k-means is the most commonly used clustering algorithm and it’s
quiet easy to understand [7]. Assuming that we have a dataset contains n objects,
the objective of the k-means algorithm is to define k clusters. Each one of the
clusters contains objects which have similar behaviours and characteristics. In
each cluster, there is a centroid (mean) represented by a point where the distance
of the the objects will be calculated. The criteria of partitioning objects into
clusters is to minimize the within-cluster sum of distances:
min
k
j=1
nj
i=1
||xi
(j)
− cj||2
30
33. , where nj is the number of objects in the cluster j, xi
(j)
is the ith
object in the jth
cluster and cj is the centroid of the jth cluster.
First of all, for initialising, the algorithm assigns a cluster to each object. There
are two commonly used initialising methods. The first one is random partition
method which randomly assign a cluster to each object and then calculate the
distances and therefore the initial centroid (mean) of each cluster. It tends to
place the initial centroids close towards the center of the dataset. The other
method is Forgy method which picks randomly k observations from the dataset
and in comparison with the random partition method, this method spreads the
initial centroids out over the dataset.
After the initialising clusters to objects, the algorithm goes through all the
objects and for each object, it computes the distance between the object and every
centroid and it assigns the object the nearest cluster. The distance measurement
used in this algorithm is Euclidean distance:
distance =
n
i=1
(xi − ci)2
, where n is the number of dimension.
Once the algorithms has gone through all the objects and finished the reassign-
ment, it goes the the next step - update. The aim of this step is to update the
centroid of each cluster. Since during the reassignment, in each cluster, there are
some objects coming in and some going out, the centroid has to be recomputed
with the new objects. The process repeats the reassignment step and the update
step as a loop until the entities in each group don’t change any more. This means
each observation finds the cluster it belongs to and in this cluster, there are other
objects which it is similar with.
Hierarchical clustering Different from the k-means algorithm, hierarchical
clustering aims to set up a hierarchy of clusters and it doesn’t require predefined
parameters. There are two means (as shown in the dendrogram 3.7) to build the
hierarchy of clusters, either by an aggregative way or by a divisive way. The ag-
glomerative way starts from individual observations at the bottom and gradually
aggregates observations into clusters and aggregates clusters into larger clusters
until the top of hierarchy. On the other hand, the divisive way is completely op-
posed to the aggregative way which is a ”top down” model. It starts from the top
and splits gradually until the end observations.
31
34. Figure 3.7: Hierarchical clustering
The biggest disadvantage of hierarchical clustering is its huge complexity. Gen-
erally, the agglomerative goes with a complexity of O(n3
) and the complexity of the
divisive method is O(2n
). Both of them are very expensive and thus require lots of
computations. When the number of observation is big enough, the agglomerative
method will be less computationally expensive than the divisive one.
In the agglomerative method, how can we decide the combination of different
clusters? To do this, we introduce distance as a measurement of similarity between
each two clusters. Different distance computation methods affects differently the
shape of the clusters because based on different distance standard, whether some
observations are closer to each other or further away from each other is different.
We list the commonly used distance as follows:
ˆ Euclidean distance: i(x1i − x2i)2;
ˆ Maximum distance: maxi ||(x1i − x2i)2
||;
ˆ Manhattan distance: i |(x1i − x2i)2
|.
Besides the distance, agglomeration method is another factor to determine
how close are two clusters (or observations). The following list consists of some
commonly used agglomeration methods:
ˆ Complete method: max dist(c1, c2) : c1 ∈ C1, c2 ∈ C2;
ˆ Single distance: min dist(c1, c2) : c1 ∈ C1, c2 ∈ C2;
ˆ Average distance: 1
|C1||C2| c1∈C1 c2∈C2 dist(c1, c2);
32
35. Once the hierarchy is constructed, we can choose the number of clusters that
we wish as shown below 3.8
Figure 3.8: Hierarchical clustering with determined clusters
33
39. References
[1] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application of
machine learning techniques for supply chain demand forecasting. European
Journal of Operational Research, 184(3):1140–1154, 2008.
[2] F. L. Chen and T. Y. Ou. Sales forecasting system based on Gray extreme
learning machine with Taguchi method in retail industry. Expert Systems
with Applications, 38(3):1336–1345, 2011.
[3] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. Large-
scale matrix factorization with distributed stochastic gradient descent. In
Proceedings of the 17th ACM SIGKDD international conference on Knowledge
discovery and data mining - KDD ’11, pages 69–77, 2011.
[4] Michael D. Geurts and J. Patrick Kelly. Forecasting retail sales using alter-
native models. International Journal of Forecasting, 2(3):261–272, January
1986.
[5] a Gunasekaran. Supply chain management: Theory and applications. Euro-
pean Journal of Operational Research, 159(2):265–268, 2004.
[6] a. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM
Computing Surveys, 31(3):264–323, 1999.
[7] Anil K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition
Letters, 31(8):651–666, 2010.
[8] Ken-ichi Kainijo and Tetsuji Tanigawa. Stock Price Pattern Recognition - A
Recurrent Neural Network Approach -. Architecture, pages 215–221.
[9] Jyrki Kivinen and Mk Warmuth. Exponentiated gradient versus gradient
descent for linear predictors. Information and Computation, 132:1–63, 1997.
[10] Chi-Jie Lu and Yuehjen E. Shao. Forecasting Computer Products Sales by
Integrating Ensemble Empirical Mode Decomposition and Extreme Learning
Machine. Mathematical Problems in Engineering, 2012:1–15, 2012.
37
40. [11] James T. Luxhø j, Jens O. Riis, and Brian Stensballe. A hybrid econometric-
neural network modeling approach for sales forecasting. International Journal
of Production Economics, 43(2-3):175–192, June 1996.
[12] Paris A. Mastorocostas, John B. Theocharis, and Vassilios S. Petridis. A
constrained orthogonal least-squares method for generating TSK fuzzy mod-
els: Application to short-term load forecasting. Fuzzy Sets and Systems,
118(2):215–233, March 2001.
[13] Sherri Rose. Big data and the future, 2012.
[14] Zhan-Li Sun, Tsan-Ming Choi, Kin-Fan Au, and Yong Yu. Sales forecasting
using extreme learning machine with applications in fashion retailing. Deci-
sion Support Systems, 46(1):411–419, 2008.
[15] The Apache Software Foundation. Apache Hadoop. Accessed 17/05/2015.
[16] S´ebastien Thomassey and Antonio Fiordaliso. A hybrid sales forecasting
system based on clustering and decision trees. Decision Support Systems,
42(1):408–421, 2006.
[17] S´ebastien Thomassey and Michel Happiette. A neural clustering and clas-
sification system for sales forecasting of new apparel items. Applied Soft
Computing Journal, 7(4):1177–1187, 2007.
[18] John T.Mentzer and Mark A.Moon. Time Series Forecasting Techniques.
2004.
[19] Mircea Rducu TRIFU and Mihaela Laura IVAN. Big Data: present and fu-
ture. Article provided by Academy of Economic Studies - Bucharest, Romania
in its journal Database Systems Journal., 5(1 (May)):32–41, 2014.
[20] W. K. Wong and Z. X. Guo. A hybrid intelligent model for medium-term sales
forecasting in fashion retail supply chains using extreme learning machine and
harmony search algorithm. International Journal of Production Economics,
128(2):614–624, 2010.
[21] G.Peter Zhang. Time series forecasting using a hybrid ARIMA and neural
network model. Neurocomputing, 50:159–175, January 2003.
[22] Tong Zhang. Solving large scale linear prediction problems using stochastic
gradient descent algorithms. In Proceedings of the twenty-first international
conference on Machine learning, volume 6, page 116, 2004.
38