SlideShare a Scribd company logo
CRANFIELD UNIVERSITY
Jinxing Lin
Application of machine learning
techniques for sales forecasting
School of Aerospace, Transport and Manufacturing
Software Engineering for Technical Computing
MSc. Thesis
Academic Year: 2014-2015
supervisor: Irene Moulitsas
June 23, 2015
CRANFIELD UNIVERSITY
School of Aerospace, Transport and Manufacturing
Software Engineering for Technical Computing
MSc. Thesis
Academic Year: 2014-2015
Jinxing Lin
Application of machine learning
techniques for sales forecasting
supervisor: Irene Moulitsas
June 23, 2015
This thesis is submitted in partial fulfilment of the requirements for the
degree of Master of Science
© Cranfield University, 2014. All rights reserved. No part of this
publication may be reproduced without the written permission of the
copyright owner.
Contents
1 Introduction 5
2 Literature review 9
2.1 Machine Learning Techniques for sales forecasting . . . . . . . . . . 9
2.1.1 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Support vector machines . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . 11
2.1.4 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . . 12
2.1.6 Extreme learning machine . . . . . . . . . . . . . . . . . . . 12
2.2 Supported technologies . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Methodology 17
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Data source . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Random forest with regression tree . . . . . . . . . . . . . . . . . . 18
3.2.1 Regression tree . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Time-series forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Main components in time-series model . . . . . . . . . . . . 22
3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Results 34
5 Analysis 35
1
6 Conclusion 36
2
Abstract
Nowadays, sales forecasting problems have been considered more and more im-
portant by companies and industries. An accurate prediction can significantly help
companies and industries understand the future trend of products and therefore
make a better sales plan, prepare the production, the stock and the transport
of products. All these improvements lead to minimise the cost while satisfying
clients’ demands. In other words, supply chains can be well improved with precise
sales forecasting.
TODO: techniques
TODO: conclusions
3
Acknowledgements
4
Chapter 1
Introduction
Recently, Information Technology (IT) is playing a more and more important
role in supply chain management (SCM), from basic IT infrastructure in a company
to Virtual Enterprise, from storage of data to analysis of data and even sales
forecasting [5]. An accurate sales forecasting can be very useful and helpful for
companies to make their decisions on the production planning as well as the sales
price. It can also help companies effectively distribute their resources, reduce
unnecessary cost and provide satisfactory customer service. Meanwhile, the sales
forecasting is affected by plenty of factors such as lifespan of product, economic
climate, competition and globalisation. In the last twenty years, a lot of researches
which aim to improve sales forecasting have been developed and most of them are
based on machine learning techniques.
First of all, let’s focus on what machine learning (ML) is. In recent years, ML is
absolutely one of the hottest topic in computer science. Even non-technical people
can’t have escaped the articles, headlines, videos, TV programmes on the rise of
big data [13, 19] and machine learning. Briefly introduced, big data is datasets
which should fit the following ”3V”:
ˆ Volume: The volume of data is so important that it determines the value of
datasets. As a matter of fact, traditional data treatments based on only one
machine aren’t suitable for the increase of volume of data any more. Thus,
new scalable data treatments are developed to fit the increasing volume of
data.
ˆ Variety: The variety of data means different forms of data in this context. Big
data can be data stored in structured database, text, images, vocal messages,
videos. And it can be collected from various sources such as internet, mobile
phone, personal PC, wearable devices.
5
ˆ Velocity: Big data is the data which is generated very rapidly. For example
during each trading session, the New York Stock Exchange obtains 1 TB of
trade information.
The volume of global data is increasing dramatically and it brings more and more
challenges in looking for the suitable technologies for supporting this extreme
growth. As shown in the figure 1.1 below, in 5 years, the volume of data will
be 10 times as what it was 2 years ago. Therefore, the research for appropriate
supporting technologies is becoming urgent. The adaptive technology should be
capable to store large volume of data such as several TB; it should be capable to
calculate in parallel which allows to significantly accelerate the computation and
provide high performance; in addition, it would be an attractive point if the tech-
nology support data streaming. Two most used platforms designed for big data
are Hadoop and Spark. These two platforms use different high performance com-
puting techniques to distribute data and jobs. A brief presentation about these
two technologies is included in the following chapter.
Figure 1.1: Growth of global data 1
The use of big data is expanding rapidly and it’s getting into every aspect of our
lives. For example, the police predicts when and where crimes happen based on big
data; health center uses big data to predict the coming of diseases; travel agencies
use big data to understand customers’ preferences and elaborate attractive travel
1
Source: http://www.emc.com/leadership/digital-universe/2014iview/
executive-summary.htm
6
plans; accurate stocks forecasting is becoming possible with big data. Our lives are
now surrounded by big data and in order to make sense of big data, to understand
its hidden brilliant values, we use machine learning algorithms to explore it.
Machine learning brings both computer science and statistic together in order
to learn from the data and find out suitable patterns or models for the data. With
these patterns or models, it is possible to project them to the future and therefore
predict. Globally, machine learning techniques can be categorised by their purpose:
ˆ classification: training a model for assigning correctly observations into their
classes such as classifying if a customer is loyal or not;
ˆ regression: training a model for predicting continuous output such as stocks
forecasting;
ˆ clustering: without knowing the groups beforehand, training a model for
splitting input observations into groups such as image recognising.
Nowadays, ML is widely used by all kind of companies. Google is using ML to
construct it’s search engine; Facebook is using ML to recommend friends or ad-
vertisements to us; Tesco is using ML to distribute coupons to customers; weather
forecasting is using ML to forecast the coming days’ weather.
After talking about big data and machine learning algorithms, certainly this
thesis is about discovering and applying machine learning techniques. There are
so many domains that machine learning can be used in and among all these, we
find it interesting to investigate how machine learning can play in sales forecasting
which is a highly focused subject in business analysis. The aims of this thesis are:
ˆ studying machine learning algorithms and understanding their mechanisms;
ˆ analysing the given dataset, elaborating a plan of how the data should be
used;
ˆ looking for pertinent patterns for the dataset and make sales forecasting.
For achieving this objective, we will first discover different machine learning al-
gorithms deeply, choose some interesting ones which might satisfy the dataset
and employ them, customise these techniques in regards with the data and then
compare their performance.
7
The following chapter is literature review. This literature review is mainly about
some machine learning techniques which have been applied in sales forecasting and
some commonly used technologies for machine learning. In chapter 3, we will de-
scribe where the data comes from, how we pre-process the data and which ML
algorithms we apply and why we choose them. After the chapter of methodology,
results of different algorithms will be shown with diagrams and charts in the chap-
ter 4. The before last chapter will mainly include the analysis of results which
mean the comparisons of the performance of different techniques. A conclusion of
the research will be included in the last chapter in which we summarise the global
process of research and the future expectation for continuing this research.
8
Chapter 2
Literature review
2.1 Machine Learning Techniques for sales fore-
casting
The exponential smoothing model [4] is one of the earliest models applied in sales
forecasting as well as the AutoRegressive Integrated Moving Average (ARIMA)
model [21]. Forecasting models like Neural Networks (NN) model [21] and fuzzy
model [12] were also often applied for sales forecasting. As the volume of data
is growing rapidly and the request for accuracy is getting higher, some new ma-
chine learning techniques are used on sales forecasting of different type of data.
For example, clustering [16] and decision trees [16] are used to develop a sales
forecasting system of the Textile-Apparel-Distribution; in order to forecast sales
of a new appeal item, neural clustering and classificaiton [17] are applied. In the
recent years, Extreme Learning Machine (ELM) has been applied very frequently
in combination with other techniques to sales forecasting.
The following sub-sections present some interesting techniques that would prob-
ably be applied or tested during this research.
2.1.1 Regression Trees
Before introducing regression tree, it is important to clarify that decision tree is
a predictive model which can be used to evaluate the value of a certain feature of
the system from the observation of the other features of the system. Regression
tree is a form of decision tree which is specific for numeric data. Basing on the
given data, it trains a tree model whose leaves represent groups of instances in the
same class and whose branches represent the separation of instances into different
groups (2.1). It is a supervised learning method that trains its model with known
9
inputs and known outputs. Once the regression is constructed, we can input
some new data and the regression tree will train the given new data and output
the prediction. Regression tree has been proved to be an efficient tool for sales
forecasting in the textile distribution [16].
Figure 2.1: Regression tree
2.1.2 Support vector machines
Support vector machines are machine learning models used for regression and
classification. The principal of this model is to create one or a set of hyperplane(s)
between different classes and the request of hyperplanes is to maximize the mar-
gins between classes so that obtain the lowest error of the classifier.Let’s take an
example which is shown in figure 2.2. A hyperplane in geometry is a subspace of
one dimension less than its ambient space. Often, it is not feasible to separate
the data on the original dimension. In this case, projection of the data onto a
higher dimension would be applied. In the paper [1], a hybrid system of RNN
and SVM was applied to forecast the sales and the results showed that the hybrid
system outperformed the simple traditional forecasting techniques such as moving
average.
1
Source: http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/
introduction_to_svm.html
10
Figure 2.2: SVM with hyperplane 1
2.1.3 Stochastic gradient descent
Sometimes, an optimisation of training model would be very useful to make the
model more efficient to predict. An often used optimisation algorithm is Gradient
descent (GD) [9]. It aims to find the minimum of a function. In order to do this,
it follows the opposite direction of the gradient at each point and it takes steps.
Here, gradient represents the direction in which a function increases the fastest.
A training model, for example a linear regression model, has its cost function,
and gradient descent method is applied to train this model and minimise its cost
function.
Gradient descent method can be roughly divided into two methods. batch gradi-
ent descent (BGD) and stochastic gradient descent (SGD) [3,22]. BGD, which has
many iterations and in each iteration, trains all the data to calculate the gradient.
To be different, in each iteration, SGD only picks a random example to train the
gradient. As the training data size in each iteration is smaller in SGD than in
BGD and SGD algorithm does not need to remember the examples which have
been previously studied, SGD is often considered as a more effective method. In
the research [?], Dr.Bottou studied the application of SGD over several machine
learning system such as K-Means, SVM, Lasso and it showed that they performed
better with SGD.
2.1.4 Neural networks
In terms of neural networks, there are various types and the one that has been
mostly applied is feed-forward and error back-propagation neural network [11,21].
11
There are at least two layers in neural networks: the input layer and the output
layer. One or some hidden layers could be inserted between the input and output
layer and each layer contains its elements. As shown in Figure 2.3, the data
will be input into the input layer and pass through hidden layers where training
algorithms are applied and finally be sent to the output layer. The information
only carries on forward in this type of neural network. The network adjusts its
weight by using the information fed back by the comparison of the real values and
the correct answers.
Figure 2.3: Artificial neural network
2.1.5 Recurrent neural networks
Recurrent neural networks (RNN) is developed basing on neural networks. The
main difference between them is that RNN allows some outputs of neurons to go
back and become inputs of other neurons. It means that the information is not
flowing in only one direction and the information can be trained several times
before being output as indicated in Figure 2.4. Owing to this fact, RNN can
perform better than one-directional neural network, for example, in stock price
forecasting [1,8].
2.1.6 Extreme learning machine
Extreme Learning Machine (ELM) has been applied to fashion supply chains
for sales forecasting [14,20]. Extreme learning machine is an algorithm for single-
hidden-layer feedforward neural networks (SLFN). In this algorithm, the input
12
Figure 2.4: Recurrent neural network
weights and the hidden biases are randomly determined and on the other hand,
ELM determines analytically the output weights with Moore-Penrose (MP) gen-
eralized generalized size. In comparison with the traditional gradient-based algo-
rithms, ELM is more rapid and more effective. In addition, it can avoid some
problems, such as stopping criteria, learning weight and learning epochs, faced by
gradient learning algorithms. The experimental results in [14] demonstrates that
the performance of ELM model is superior to some sales forecasting algorithm
based on backpropagation neural network. In some studies, ELM is combined
with other machine learning techniques to build forecasting models [2,10,20].
ELM and harmony search algorithm
In the research [20], Harmony Search (HS) algorithm was applied with the com-
bination of ELM for sales forecasting in fashion retail supply chains. Harmony
search algorithm can be integrated with ELM to construct a novel meta-heuristic
optimisation algorithm which we can obtain optimal NN weights and have high
forecasting performance with. According to the research [20], this hybrid intelli-
gent system significantly outperforms traditional ARIMA models as well as two
other developed neural network models for fashion sales forecasting.
ELM and ensemble empirical mode decomposition
Empirical mode decomposition (EMD) is a signal processing technology based
on the local characteristics time scales of a signal and it is usually applied to
decompose a signal into intrinsic mode functions (IMF) which are finite and small
13
number of components. In order to avoid the major problem of EMD which is mode
mixing problem, ensemble empirical mode decomposition (EEMD) is developed.
EEMD consists in EMD method and a noise-assisted data analysis method and
it is designed to alleviate the mode mixing problem. With EEMD, original sales
data can be converted into IMFs and the latter will be input into ELM method
to forecast sales for computer products [10]. In comparison with single ELM,
single support vector regression (SVR), single back-propagation neural network
(BPN), EEMD-ELM model is better in terms of performance in developing the
sales forecasting of computer product.
ELM and Gray relation analysis
Gray relation analysis (GRA) can be integrated with ELM to set up a hybrid
sales forecasting system (GElM). GRA measures the relative distance, between
compared series of data and reference series of data, which is called Gray relation
grades (GRG). The ranking of the GRG can show that which factors that affect
the sales amounts the most and these influential factors are used as input variable
of ELM models. The results of experiments show that GELM hybrid system
outperforms BPN and Military Families Learning Network (MFLN) models [2].
2.2 Supported technologies
Nowadays, as we are in the age of big data, more and more technologies designed
for big data have come to light. In this section, we are going to present some
technologies which have been used the most recently and which we might use in
this thesis project.
2.2.1 Apache Hadoop
Apache Hadoop [15] is a platform which is designed for distributed storage
and distributed processing. It is a platform dedicated to large volume datasets.
Basically, Hadoop gets data files as input and it decomposes data into large blocks
and distributes them to all the nodes in the cluster. According to different data
received by different node, these nodes will receive a packaged code and then
execute the computation in parallel. This fact can largely improve the efficiency
in comparison with computing all data on only one node.
In terms of component, there are four main components in Hadoop:
ˆ Hadoop Common: contains library and supports other Hadoop modules;
14
ˆ Hadoop distributed file system (HDFS): provides high throughput access and
performs the best with large files;
ˆ Hadoop Yet Another Resource Negotiator (YARN): a distributed resource
scheduler;
ˆ Hadoop MapReduce: a distributed processing framework which decomposes
work into small parallelized map and reduce worker.
2.2.2 Apache Spark
Apache Spark is cluster computing platform which is built for training large
dataset. Comparing to one of its competitor, Hadoop, Spark extends the MapRe-
duce model (used by Hadoop) to support more types of computation. MapReduce
model splits data to discs while data in Apache Spark is split into buffer cache
(memory) and because of this, Apache Spark outperforms Apache Hadoop in most
cases in terms of speed.
Basically, Apache Spark consists of two principal parts: a management system
and a distributed storage system. The management system can be Spark stan-
dalone or Spark pseudo-distributed mode or Hadoop YARN or Apache Mesos.
Same for the distributed storage system, there are a lot of choices such as HDFS,
Cassandra, Amazon S3.
The platform is written in Scala and its compatible with several languages:
Scala, Java, Python and SQL. The high compatibility with the main data process-
ing languages help Apache Spark become more and more popular.
The main characteristic components of Apache Spark are:
ˆ Spark Core which is used for task scheduling, memory management, fault
recovery, and providing API to create and manipulate Resilient Distributed
Datasets (RDD, a collection of objects distributed into computation nodes
in order to be computed in parallel);
ˆ Spark SQL which supports a new data abstraction for structured data:
SchemaRDD;
ˆ Spark Streaming which allows to manipulate data stream;
ˆ MLlib, a Machine Learning (ML) library, which provides some common Ma-
chine Learning algorithms;
ˆ GraphX which is used to perform distributed graph computations.
15
Since this thesis is mostly about performing machine learning algorithms on large
volume of data, Apache Spark is then chosen as the cluster computing platform
where we submit our computing tasks.
2.2.3 Scala
Scala is a programming language based on JVM compiler and it supports multi-
paradigm at the same time: object-oriented programming and functional program-
ming. As a fusion of two programming concept, Scala allows to build structures and
elements that treat computation as calling mathematical functions. Meanwhile,
it allows to build large system with component abstraction and legible structure.
Moreover its open-source which makes it easier for user to use it.
As Apache Spark is compatible with Scala and Scala is less verbose in comparison
with Java and Python, we chose Scala as our main programming language.
2.2.4 R
Spark is an enterprise-used level development tool, as we mentioned before, it’s
very efficient for applying basic machine learning algorithms provided in MLlib.
But facing deep learning algorithms, it may not be a very smart choice. Nowadays,
”GPU + CUDA” is becoming the main architecture for data scientists to build
and run their deep learning programs thanks to the high performance of the way
that GPU processes tasks.
On the other hand, spark is not a good choice for building customised machine
learning algorithm either since it doesn’t contain matrix data structure and most
of machine learning techniques require a lot of use of matrices.
In this project, some deep learning techniques as neural network and extreme
learning machine are expected and building customised algorithms is indispens-
able so that we need another tool to perform these deep learning techniques and
constructing our own algorithms. After doing some researches on different alter-
native tools, we finally decide to choose R. R is a programming language which
provides a numerical computing environment (supports matrices arithmetic) and
supports performing computations on CUDA GPUs. This fact can possibly speed
up the computations. Moreover, thanks to a recently released package - SparkR,
it is possible scale R programs in Spark (a distributed fashion).
16
Chapter 3
Methodology
3.1 Data
3.1.1 Data source
For examining the effectiveness of different machine learning algorithms in sales
forecasting, a french surgical equipment company Didactic provides a dataset. This
dataset consists of monthly sales of some surgical equipments in the period from
October 2013 to Mars 2015. 758 different products are included in this dataset.
Each product has its own reference, its daily sales quantity corresponding to each
client and its price. In addition, there is also an available dataset of call to tender
of surgical equipments in the market. The aim is to build sales forecasting model
with information provided in these two available datasets.
3.1.2 Data preparation
First of all, the program reads the data from a data file which contains all the
sales records of Didactic. Since in the original data file, the data is separated by
product, by order and by day and what we want to obtain as result is the total
quantity of daily sales of each product, a sum up of all the sales’ quantity per
product per day is essential. The following table (Table 3.1) is an example of the
data after the sum up of daily data:
According to marketing experts in Didactic, seasonality might be an important
factor which affects the sales. It means that the sales might vary between a certain
range in accordance with the different months in a year. Following this marketing
assumption, we will also analyse the sales changes month by month. Therefore, a
sum up of sales’ quantity per product per month is needed. You can find out an
example of the treated data as follows (Table 3.2):
17
reference year month day quantity
1312114 2014 11 13 34
1312114 2014 11 14 26
... ... ... ... ...
1312115 2014 11 13 15
Table 3.1: Prepared data: daily sum
reference year month quantity
1312114 2014 11 543
1312114 2014 12 614
... ... ... ...
1312115 2014 11 317
Table 3.2: Prepared data: monthly sum
Based on the assumption of seasonality, we supposed that there is a relation
between the average quantity of four previous months’ sales and the quantity of
the following month’s sales. Hence, a computation of the average quantity of four
previous months’ sales is request. After another data preparation process, we get
data as follows (Table 3.3):
reference year month quantity avg4premonths
1312114 2014 11 543 517
1312114 2014 12 614 627
... ... ... ... ...
1312115 2014 11 317 226
Table 3.3: Prepared data: monthly sum and average of four previous months’ sales
3.2 Random forest with regression tree
3.2.1 Regression tree
As we have mentioned in the literature review, regression tress are methods
commonly used to train models with given predictor variables x and a continuous
response y. With these models, we can predict the value of y for a new given
value of x.
18
In the essence, the regression tree algorithm analyses the inputs and start the
creation of tree from the top (root node) to the bottom (leaves). The process of
construction of a tree is a recursive partition. Each node works like a filter. This
filter is a question relative to a particular feature and it filters the input instances
into sub-nodes or leaves. Actually, those conditions are based on the features of
instances. For example ”Is price > £20” or ”Is lifespan < 5 months”. And leaves
are the final groups of instances which lead to similar outputs. A point x belongs to
a leaf if x is assigned to the corresponding cell of the partition. Trees are growing
exponentially with the growth of its depth. Once the partition of instances is done,
an average of the target value in all instances will be calculated for each leaf and
this average will be the prediction value for all the instances distributed into this
group. Then, when we input a new instance of features into this model, it will be
going down along the tree till it reaches one of the leaves. And it will carry the
prediction target value which is assigned to that leaf.
One of the most important advantages of regression tree is its interpretability.
As it contains a question relative to a certain feature in each node, the whole
model will be very simple to understand, even by people who don’t have machine
learning background. It can be considered as a tree contains a lot of conditions.
Once a condition is satisfied, the instance goes to one side, if it isn’t satisfied,
then it follows the other branch. No matter how big the tree is, it is still very
understandable. Once the model is built, it will be fast to make prediction because
there is no complex calculation to execute.
Implementation
As what has been mentioned in the literature review, Spark provides library of
machine learning algorithms which includes regression tree algorithms. We apply
the regression tree technique to two main data models.
The first model is based on the hypothesis that the future sales only depend on
the the year and the month. Actually, we consider it like a time-series model. In
a time-series model, time (or date in this case) plays the key role and the other
factors such as price changes are secondary factors. A secondary factor can also
affect the sales but only over a small portion and this small portion is considered
as noise. More details about the time-series model will be presented in the next
section. After all, in this first model, we have year and month as features and
quantity as target. The regression tree algorithm analyses the inputs and build
tree like the following example 3.1:
19
Figure 3.1: Regression tree with year-month model
In this graph, we can observe that year and month are used as features. In
each node, there is a question relative to one of these two features and it splits the
input instances into sub-partitions. At the bottom, we obtain groups of instances
which we call them leaves. There is a quantity value assigned to each leaf and this
value is the average of the quantity of the instances fallen in that leaf. After the
construction of regression tree, if a new instance of features comes in for the sales
forecasting, for example (year = 2016, month = 2), this instance will go from the
top to the bottom of the tree and fall into a leaf. Then it takes the qty of that
leaf as it’s predictive target value.
Comparing with the first model, the only difference in the second model is that
it contains one more feature - qty4 (the average of 4 previous months’ sales). It
uses exactly the same way to train the regression tree and to predict the target
value of new given instances.
Results
TODO put tuning algorithms’ results
3.2.2 Random forest
Overfitting the training dataset is a problem often happens to regression trees.
Owing to this fact, random forest is introduced to correct it. Random forest
method was proposed and developed by Leo Breiman in 2001 [?]. It represents
20
a family of ensemble learning methods for classification and regression. These
methods utilise exclusively decision trees as classifiers and output either the class
or the mean prediction of each individual tree. During the process of construct-
ing decision trees, we introduce a random factor with the use of Bagging and
Random Feature Selection. In random forests, all the decision trees operate
independently. Each decision tree is built based on a random vector of parame-
ters. For example the kth
tree in the forest depends only on the vector θk and it’s
independent of other vectors. All the trees participate in the final decision. Here
is a graphical presentation of random forest structure 3.2:
Figure 3.2: Random Forest
Bagging
Bagging is a method selecting a subset of the training data for the construction
of each decision in the random forest. These sub-datasets are called bootstrap.
Bagging repeatedly and randomly picks a random sample with replacement from
the training dataset. TODO put a figure to demonstrate Bagging
Random Feature Selection
Random Feature Selection focus on the features of data. It actually selects
randomly a fixed number of k features and then among these k chosen features,
select the feature which optimise the partition of data. The way to define the
optimal feature is to compare the impurity of data assigned to sub-nodes. TODO
put a figure to demonstrate RFS
21
All in all, bagging is applied to construct random sub-dataset for each decision
tree while random feature selection is used to select optimal feature among a sub-
group of features during the learning process of each tree. Both these two methods
lead to a better machine learning model which does not overfit the input dataset.
Implementation
Random forest algorithm is provided by MLlib and the way to call it is quite
similar to the way of calling regression tree. In order to get the best parameters
for our dataset, we tune the algorithms with different values for each parameter
such as the number of trees in the forest, the maximum depth possible of a tree,
the maximum number of intervals possible of a feature.
Result
TODO put tuning results here
3.3 Time-series forecasting
After applying directly some existing machine learning algorithms over the data,
we do get some results which are not bad. But in order to get more control over
the learning methods, we decide to build a suitable algorithms for this use case and
this dataset. Since the main data that we got is related to date and the marketing
experts told us that seasonality might affect the sales, we decide to build a time-
series model. A time-series technique looks at only the patterns of the history of
actual sales and based on these patterns, predict the future sales [18].
3.3.1 Main components in time-series model
There are four main components we have to take into account when we are
trying to set up a time-series model:
ˆ level: a horizontal sales history;
ˆ trend: a pattern that represents continuously the sales increase or decrease;
ˆ seasonality: a pattern that represents repeatedly how sales increases and
decreases within a certain period (e.g. one year);
ˆ noise: a random fluctuation which might be explained by some features
expect times such as price changes or the quality of customer service.
1
Source: [18]
22
Figure 3.3: Time-series components 1
3.3.2 Implementation
Visualisation of data
In order to get a better insight of the data, we have decided to visualise the data
in the form of line charts. The first line chart is about the monthly total sales of
all the products in Didactic. Actually, we want to have a global view of how the
sales varies throughout the time. With this view, we can roughly see if there is a
trend or seasonality in the products’ sales in this company. For creating this line
chart, we go through each month and sum up all the sales’ quantity. Then we
display these sums with their corresponding date so that we can see how the sales
evolve with time. Put graphs here
After seeing it globally, we also need some line charts which provide some more
details. In fact, the time-series models are usually constructed for a product or
a family of similar products. Due to this fact, we decide to draw a quantity-date
line plot for each of the 20 products who have been sold for the longest time. We
take these top 20 products because they have the most sales history and this fact
allows us to have a more clear view on how the sales evolves with time. These plots
for different products are showing the evolution of daily sales, so they often have
sudden peaks or drops (short-term fluctuations). For decreasing the unexpected
effect of the short-term fluctuations, we apply the technique - moving average.
Moving average is a technique used to smooth out short-term fluctuations and
emphasise long-term trends or periodical cycles. Essentially, for each point of
data, it creates a small subset which contains data around that point and then
23
calculate the averages of data in this subset. These averages become the new
values of points. Put graphs here
Proposed models
There are many possible approach to construct a model for time-series problem.
We propose here two different models consisting of trends and seasonal coefficients:
ˆ Multiplicative model;
ˆ Addictive model.
Multiplicative model This model is based on the assumption that target values
can be predicted with the multiplication of values on a trend and relevant seasonal
coefficients. The following figure 3.5 and formula are one of the multiplicative
model:
Salest = Trendt ∗ SAt + Noiset
where SA = SeasonalAdjustement and t = time(day/week/month/etc). In this
model, we consider that a proportional relationship exists between target values
and time and values over the trend. Once we have calculated seasonal coefficients
and the trend, we can then project these two to the future and forecast a target
value for a given input.
Addictive model Apart from the multiplicative model, we employ an addictive
model as well. The difference is that in this model, we assume that a target
value can be calculated by the addition of a value on the trend and a relevant
residue. For each observation, there is a corresponding residue and there is a
relationship between these residues. Since the fact that the residues are related to
each other, the objective is to compute the seasonal coefficients based on residues
in each observation and predict the future residues and therefore obtain future
target values. As shown in the figure ?? and the formula:
Salest = Trendt + Residuet + Noiset
where
Residuet = Residuet−1 ∗ SAt
where SA = SeasonalAdjustement and t = time(day/week/month/etc). , we
can observe that the sales quantity is an addition of the trend and the residues
and each residue is calculated from its previous residue. In comparison of the
multiplicative model, there is a shortage of this model. It might propagate the
errors if we want to predict the values for more than one point in the future. Since
24
Figure 3.4: Multiplicative model
each residue is computed from its previous one and if there is an error somewhere,
it will propagate with the process. Therefore the further is the point that we want
to predict, the bigger might be the error.
Calculation of trend
A trend of a product is the pattern shows gradually how the sales increases or
decreases. It can be either a line which fits a linear function or a curve. If we
have a suitable pattern for a product, we can know roughly which direction is
the growth of product going to and how fast is it going. We mainly apply three
regression methods to look for the the trend:
25
Figure 3.5: Addictive model
ˆ linear regression;
ˆ kernel regression;
ˆ LOcal regrESSion (LOESS regression).
Among these three regressions techniques, only the linear regression method is a
parametric method. The other two are both non-parametric methods.
Linear regression Y = α ∗ X + β is a very simple formula that everyone is
similar with and the linear regression is an approach for defining the relationship
between a target variable Y and one or several explanatory variables X based
26
on this simple formula. The aim is to calculates the correct values for α and β
with which it leads to smallest difference between the expected values and the
calculated values. In our case, the target variable is the sales’ quantity and there
is only one explanatory variable which is the date. Therefore, the linear model
that are looking for is as follows:
Quantity = α ∗ Date + β
where Date = Y ear +Month/12 (transforming the date into a number). Once we
get the value of α and β, we can set up the linear function. Input a new value of
Data to this function, we can expect an output of Quantity. put graph here
Kernel regression As a non-parametric method, kernel regression aims at find-
ing a non-linear relation between an explanatory variable X and a target variable
Y . The regression function for Y about X is
m(x) = E(Y |X)
where m(x) is the estimator of the regression function. Meanwhile, there are differ-
ent available estimators for the regression function and we only pick a commonly
used one - Nadaraya-Waton estimator:
ˆmh(x) =
n
i=1 Kh(x − xi)yi
n
i=1 Kh(x − xi)
where k is a kernel which is used as a weighting function and h is a bandwidth. For
example, K(x) = 1√
2π
e−(x)2
/2
is a commonly used kernel function. The estimator
function can be transformed into
f(x) = n−1
h−1
i=1
nK(
x − x−1
h
).
Let’s take an example where X has n points. At each point, the kernel regression
technique takes n∗h points around that points and applies the weighting function
K(u) on their Y values. Then we can take the average of these weighted Y values
and the average will become the new target value of that point. In the essence, it
is smoothing the values with a kernel function. Thus it can build a model which
fits well the given data. put graph here
LOESS regression The other non-parametric technique we employ is LOESS
regression. Basically, as in Kernel regression, an action is operated at each point.
LOESS regression fits a low-degree polynomial function to a subset of the data at
27
every point in that data set. A subset of data is the neighbourhood points of a
point. The way to fit the polynomial is using weighted least squares with which
the central points gain more weight and the further points on both sides gain less
weight. It is also sort of data smoothing. The size of subset of data is determined
by the bandwidth as in the kernel regression. In comparison with other regression
methods, LOESS regression doesn’t need a specific function to a model, it only
applies polynomial on each subset of data. Moreover, its flexibility makes it one
of the best choices for sophisticated data models. In the other hand, it is a very
computationally expensive technique. put graph here
Calculation of seasonality
As we have mentioned before, in time-series models, there is another component
which varies the values and that is the seasonality. Some products are well relative
with seasons or months or weeks or even days. For adjusting the model to this kind
of products, we need to introduce the seasonal adjustments. Seasonal adjustments
are in fact coefficients which are calculated for each time-period and which the
model can make its values more adaptive to real data. We use two approaches to
compute these coefficients for each month in a year.
Let’s assume that there might have a relation between the current month’s sales
quantity and the next month’s sales quantity. It can be illustrated by the following
formula:
Salesm = SAm ∗ Salesm−1
where SA = SalesAdjustment. We need make to add a new column consists of
previous month’s sales quantity to the data table as shown below 3.4 so that we
can compute the coefficients for each month.
reference year month quantity quantity1
1312114 2014 11 534 556
1312114 2014 12 614 534
... ... ... ... ...
1312115 2014 11 317 289
Table 3.4: Prepared data: seasonal coefficients computation solution 1
In this case the first month of each reference can not be taken account into the
calculation of seasonal coefficients since they don’t have previous month’s sales.
28
In the second solution, the hypothesis is that a relationship exists between the
current month’s sales quantity and the average of its 4 previous months’ sales
quantity. Hence it drives to:
Salesm = SAm ∗ avg(Salesm−1 + Salesm−2 + Salesm−3 + Salesm−4)
.
As the previous solution, we need to transform the data table. After the
transformation, we have a table as follows 3.5:
reference year month quantity quantity1 quantity2 quantity3 quantity4
1312114 2014 11 534 556 515 478 533
1312114 2014 12 614 534 533 515 478
... ... ... ... ... ... ... ...
1312115 2014 11 317 289 313 337 329
Table 3.5: Prepared data: seasonal coefficients computation solution 2
As a result of the need of 4 previous months, we cannot use the first 4 months’
sales to compute the coefficients.
Clustering
We meet a difficulty in terms of data, actually the data we have is not enough
to train one series of seasonal coefficients for one product. For each product,
the maximum historical sales we have is of 1 year and a half. For some recently
come out products, we only have less than one year historical sales. As there are
less than several years sales records for each product, if we calculate the seasonal
adjustments based on only one product, these coefficients are going to overfit the
data. To be more detailed, the model will consider that sales variation in next
year will be exactly the same as this year and this is quiet rare in the reality. To
deal with this problem of lack of information, we brought in a machine learning
technique - clustering. The objective is to use clustering to gather products have
similar variations about time and put them in the same subset. Then we compute
a series of seasonal coefficients for each subset rather than a series of seasonal
coefficients for each product.
Let’s focus on the mechanism of clustering techniques. Clustering is an unsu-
pervised learning method organising a group of objects that share similar charac-
teristics. Imagine we have a large set of data and we want to split them into some
subsets of data so that we can find learn from each subset instead of learning from
just one huge set of data since learning from a smaller set of data is sometimes
29
easier to obtain expected information. The following figure 3.6 shows the partition
of data into clusters:
Figure 3.6: Clustering
Clustering which aims to find structure within a given set of data can be applied
in this case. There are some commonly used clustering models [6] such as:
ˆ Centroid models: k-means;
ˆ Connectivity models: hierarchical clustering;
ˆ Distribution models: Expectation-maximisation;
ˆ Graph theory model: Highly Connected Subgraph (HCS) algorithm.
Among these techniques, we are interested in the application of k-means algorithm
and hierarchical clustering.
K-means The k-means is the most commonly used clustering algorithm and it’s
quiet easy to understand [7]. Assuming that we have a dataset contains n objects,
the objective of the k-means algorithm is to define k clusters. Each one of the
clusters contains objects which have similar behaviours and characteristics. In
each cluster, there is a centroid (mean) represented by a point where the distance
of the the objects will be calculated. The criteria of partitioning objects into
clusters is to minimize the within-cluster sum of distances:
min
k
j=1
nj
i=1
||xi
(j)
− cj||2
30
, where nj is the number of objects in the cluster j, xi
(j)
is the ith
object in the jth
cluster and cj is the centroid of the jth cluster.
First of all, for initialising, the algorithm assigns a cluster to each object. There
are two commonly used initialising methods. The first one is random partition
method which randomly assign a cluster to each object and then calculate the
distances and therefore the initial centroid (mean) of each cluster. It tends to
place the initial centroids close towards the center of the dataset. The other
method is Forgy method which picks randomly k observations from the dataset
and in comparison with the random partition method, this method spreads the
initial centroids out over the dataset.
After the initialising clusters to objects, the algorithm goes through all the
objects and for each object, it computes the distance between the object and every
centroid and it assigns the object the nearest cluster. The distance measurement
used in this algorithm is Euclidean distance:
distance =
n
i=1
(xi − ci)2
, where n is the number of dimension.
Once the algorithms has gone through all the objects and finished the reassign-
ment, it goes the the next step - update. The aim of this step is to update the
centroid of each cluster. Since during the reassignment, in each cluster, there are
some objects coming in and some going out, the centroid has to be recomputed
with the new objects. The process repeats the reassignment step and the update
step as a loop until the entities in each group don’t change any more. This means
each observation finds the cluster it belongs to and in this cluster, there are other
objects which it is similar with.
Hierarchical clustering Different from the k-means algorithm, hierarchical
clustering aims to set up a hierarchy of clusters and it doesn’t require predefined
parameters. There are two means (as shown in the dendrogram 3.7) to build the
hierarchy of clusters, either by an aggregative way or by a divisive way. The ag-
glomerative way starts from individual observations at the bottom and gradually
aggregates observations into clusters and aggregates clusters into larger clusters
until the top of hierarchy. On the other hand, the divisive way is completely op-
posed to the aggregative way which is a ”top down” model. It starts from the top
and splits gradually until the end observations.
31
Figure 3.7: Hierarchical clustering
The biggest disadvantage of hierarchical clustering is its huge complexity. Gen-
erally, the agglomerative goes with a complexity of O(n3
) and the complexity of the
divisive method is O(2n
). Both of them are very expensive and thus require lots of
computations. When the number of observation is big enough, the agglomerative
method will be less computationally expensive than the divisive one.
In the agglomerative method, how can we decide the combination of different
clusters? To do this, we introduce distance as a measurement of similarity between
each two clusters. Different distance computation methods affects differently the
shape of the clusters because based on different distance standard, whether some
observations are closer to each other or further away from each other is different.
We list the commonly used distance as follows:
ˆ Euclidean distance: i(x1i − x2i)2;
ˆ Maximum distance: maxi ||(x1i − x2i)2
||;
ˆ Manhattan distance: i |(x1i − x2i)2
|.
Besides the distance, agglomeration method is another factor to determine
how close are two clusters (or observations). The following list consists of some
commonly used agglomeration methods:
ˆ Complete method: max dist(c1, c2) : c1 ∈ C1, c2 ∈ C2;
ˆ Single distance: min dist(c1, c2) : c1 ∈ C1, c2 ∈ C2;
ˆ Average distance: 1
|C1||C2| c1∈C1 c2∈C2 dist(c1, c2);
32
Once the hierarchy is constructed, we can choose the number of clusters that
we wish as shown below 3.8
Figure 3.8: Hierarchical clustering with determined clusters
33
Chapter 4
Results
34
Chapter 5
Analysis
35
Chapter 6
Conclusion
36
References
[1] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application of
machine learning techniques for supply chain demand forecasting. European
Journal of Operational Research, 184(3):1140–1154, 2008.
[2] F. L. Chen and T. Y. Ou. Sales forecasting system based on Gray extreme
learning machine with Taguchi method in retail industry. Expert Systems
with Applications, 38(3):1336–1345, 2011.
[3] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. Large-
scale matrix factorization with distributed stochastic gradient descent. In
Proceedings of the 17th ACM SIGKDD international conference on Knowledge
discovery and data mining - KDD ’11, pages 69–77, 2011.
[4] Michael D. Geurts and J. Patrick Kelly. Forecasting retail sales using alter-
native models. International Journal of Forecasting, 2(3):261–272, January
1986.
[5] a Gunasekaran. Supply chain management: Theory and applications. Euro-
pean Journal of Operational Research, 159(2):265–268, 2004.
[6] a. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM
Computing Surveys, 31(3):264–323, 1999.
[7] Anil K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition
Letters, 31(8):651–666, 2010.
[8] Ken-ichi Kainijo and Tetsuji Tanigawa. Stock Price Pattern Recognition - A
Recurrent Neural Network Approach -. Architecture, pages 215–221.
[9] Jyrki Kivinen and Mk Warmuth. Exponentiated gradient versus gradient
descent for linear predictors. Information and Computation, 132:1–63, 1997.
[10] Chi-Jie Lu and Yuehjen E. Shao. Forecasting Computer Products Sales by
Integrating Ensemble Empirical Mode Decomposition and Extreme Learning
Machine. Mathematical Problems in Engineering, 2012:1–15, 2012.
37
[11] James T. Luxhø j, Jens O. Riis, and Brian Stensballe. A hybrid econometric-
neural network modeling approach for sales forecasting. International Journal
of Production Economics, 43(2-3):175–192, June 1996.
[12] Paris A. Mastorocostas, John B. Theocharis, and Vassilios S. Petridis. A
constrained orthogonal least-squares method for generating TSK fuzzy mod-
els: Application to short-term load forecasting. Fuzzy Sets and Systems,
118(2):215–233, March 2001.
[13] Sherri Rose. Big data and the future, 2012.
[14] Zhan-Li Sun, Tsan-Ming Choi, Kin-Fan Au, and Yong Yu. Sales forecasting
using extreme learning machine with applications in fashion retailing. Deci-
sion Support Systems, 46(1):411–419, 2008.
[15] The Apache Software Foundation. Apache Hadoop. Accessed 17/05/2015.
[16] S´ebastien Thomassey and Antonio Fiordaliso. A hybrid sales forecasting
system based on clustering and decision trees. Decision Support Systems,
42(1):408–421, 2006.
[17] S´ebastien Thomassey and Michel Happiette. A neural clustering and clas-
sification system for sales forecasting of new apparel items. Applied Soft
Computing Journal, 7(4):1177–1187, 2007.
[18] John T.Mentzer and Mark A.Moon. Time Series Forecasting Techniques.
2004.
[19] Mircea Rducu TRIFU and Mihaela Laura IVAN. Big Data: present and fu-
ture. Article provided by Academy of Economic Studies - Bucharest, Romania
in its journal Database Systems Journal., 5(1 (May)):32–41, 2014.
[20] W. K. Wong and Z. X. Guo. A hybrid intelligent model for medium-term sales
forecasting in fashion retail supply chains using extreme learning machine and
harmony search algorithm. International Journal of Production Economics,
128(2):614–624, 2010.
[21] G.Peter Zhang. Time series forecasting using a hybrid ARIMA and neural
network model. Neurocomputing, 50:159–175, January 2003.
[22] Tong Zhang. Solving large scale linear prediction problems using stochastic
gradient descent algorithms. In Proceedings of the twenty-first international
conference on Machine learning, volume 6, page 116, 2004.
38

More Related Content

What's hot

Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analyticsPrasad Narasimhan
 
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarInstitute of Contemporary Sciences
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market PredictionMRIDUL GUPTA
 
Predictive analytics in mobility
Predictive analytics in mobilityPredictive analytics in mobility
Predictive analytics in mobilityEktimo
 
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...idescitation
 
ms-ba-course-descriptions
ms-ba-course-descriptionsms-ba-course-descriptions
ms-ba-course-descriptionsAniket Joshi
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenPoo Kuan Hoong
 
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiBusiness Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiProfessor Lili Saghafi
 
Churn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsChurn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsSalford Systems
 
Stock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised LearningStock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised LearningSharvil Katariya
 
IRJET- Future Stock Price Prediction using LSTM Machine Learning Algorithm
IRJET-  	  Future Stock Price Prediction using LSTM Machine Learning AlgorithmIRJET-  	  Future Stock Price Prediction using LSTM Machine Learning Algorithm
IRJET- Future Stock Price Prediction using LSTM Machine Learning AlgorithmIRJET Journal
 
PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...
PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...
PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...Venkat Projects
 
Prognosis - An Approach to Predictive Analytics- Impetus White Paper
Prognosis - An Approach to Predictive Analytics- Impetus White PaperPrognosis - An Approach to Predictive Analytics- Impetus White Paper
Prognosis - An Approach to Predictive Analytics- Impetus White PaperImpetus Technologies
 
01 deloitte predictive analytics analytics summit-09-30-14_092514
01   deloitte predictive analytics analytics summit-09-30-14_09251401   deloitte predictive analytics analytics summit-09-30-14_092514
01 deloitte predictive analytics analytics summit-09-30-14_092514bethferrara
 
Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Minitab, LLC
 

What's hot (20)

Application of predictive analytics
Application of predictive analyticsApplication of predictive analytics
Application of predictive analytics
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
predictive analytics
predictive analyticspredictive analytics
predictive analytics
 
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
 
Stock Market Prediction
Stock Market PredictionStock Market Prediction
Stock Market Prediction
 
Predictive analytics in mobility
Predictive analytics in mobilityPredictive analytics in mobility
Predictive analytics in mobility
 
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...
 
ms-ba-course-descriptions
ms-ba-course-descriptionsms-ba-course-descriptions
ms-ba-course-descriptions
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
 
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili SaghafiBusiness Intelligence & Predictive Analytic by Prof. Lili Saghafi
Business Intelligence & Predictive Analytic by Prof. Lili Saghafi
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
Churn Modeling For Mobile Telecommunications
Churn Modeling For Mobile TelecommunicationsChurn Modeling For Mobile Telecommunications
Churn Modeling For Mobile Telecommunications
 
Stock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised LearningStock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised Learning
 
IRJET- Future Stock Price Prediction using LSTM Machine Learning Algorithm
IRJET-  	  Future Stock Price Prediction using LSTM Machine Learning AlgorithmIRJET-  	  Future Stock Price Prediction using LSTM Machine Learning Algorithm
IRJET- Future Stock Price Prediction using LSTM Machine Learning Algorithm
 
PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...
PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...
PREDICTION OF CRUDE OIL PRICES USING SVR WITH GRID SEARCH CROSS VALIDATION AL...
 
Predictive data analytics models and their applications
Predictive data analytics models and their applicationsPredictive data analytics models and their applications
Predictive data analytics models and their applications
 
Prognosis - An Approach to Predictive Analytics- Impetus White Paper
Prognosis - An Approach to Predictive Analytics- Impetus White PaperPrognosis - An Approach to Predictive Analytics- Impetus White Paper
Prognosis - An Approach to Predictive Analytics- Impetus White Paper
 
01 deloitte predictive analytics analytics summit-09-30-14_092514
01   deloitte predictive analytics analytics summit-09-30-14_09251401   deloitte predictive analytics analytics summit-09-30-14_092514
01 deloitte predictive analytics analytics summit-09-30-14_092514
 
Node JS Training in Bangalore Classroom, Online myTectra
Node JS Training in Bangalore Classroom, Online myTectraNode JS Training in Bangalore Classroom, Online myTectra
Node JS Training in Bangalore Classroom, Online myTectra
 
Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC
 

Viewers also liked

Machine learning ~ Forecasting
Machine learning ~ ForecastingMachine learning ~ Forecasting
Machine learning ~ ForecastingShaswat Mandhanya
 
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYCRonald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYCMLconf
 
Forecasting Slides
Forecasting SlidesForecasting Slides
Forecasting Slidesknksmart
 
Probability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning PerspectiveProbability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning Perspectivebutest
 
Practical Machine Learning with Prediction APIs
Practical Machine Learning with Prediction APIsPractical Machine Learning with Prediction APIs
Practical Machine Learning with Prediction APIsSalesforce Developers
 
Demand forecasting
Demand forecastingDemand forecasting
Demand forecastingjyyothees mv
 
RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011
RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011
RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011G3 Communications
 
Semiconductor industry demand forecasting using custom models
Semiconductor industry demand forecasting using custom modelsSemiconductor industry demand forecasting using custom models
Semiconductor industry demand forecasting using custom modelsrrhm90
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...10x Nation
 
Data Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsData Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsEdureka!
 
Presentation Machine Learning
Presentation Machine LearningPresentation Machine Learning
Presentation Machine LearningPeriklis Gogas
 
Machine Learning and the Cloud
Machine Learning and the CloudMachine Learning and the Cloud
Machine Learning and the CloudAndrew Bogard
 
Probability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning PerspectiveProbability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning Perspectivebutest
 
J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016Michalis Avgoulis
 
Scope of managerial economics
Scope of managerial economics Scope of managerial economics
Scope of managerial economics jyyothees mv
 

Viewers also liked (20)

Machine learning ~ Forecasting
Machine learning ~ ForecastingMachine learning ~ Forecasting
Machine learning ~ Forecasting
 
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYCRonald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
 
Forecasting Slides
Forecasting SlidesForecasting Slides
Forecasting Slides
 
Probability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning PerspectiveProbability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning Perspective
 
Practical Machine Learning with Prediction APIs
Practical Machine Learning with Prediction APIsPractical Machine Learning with Prediction APIs
Practical Machine Learning with Prediction APIs
 
Pycon 2012 Scikit-Learn
Pycon 2012 Scikit-LearnPycon 2012 Scikit-Learn
Pycon 2012 Scikit-Learn
 
Demand forecasting
Demand forecastingDemand forecasting
Demand forecasting
 
Forecasting (1)
Forecasting (1)Forecasting (1)
Forecasting (1)
 
RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011
RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011
RSR's Brian Kilcourse Presents The State of Retail Demand Forecasting 2011
 
Demand forecasting
Demand forecastingDemand forecasting
Demand forecasting
 
Semiconductor industry demand forecasting using custom models
Semiconductor industry demand forecasting using custom modelsSemiconductor industry demand forecasting using custom models
Semiconductor industry demand forecasting using custom models
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
 
solomonaddai
solomonaddaisolomonaddai
solomonaddai
 
Data Science : Make Smarter Business Decisions
Data Science : Make Smarter Business DecisionsData Science : Make Smarter Business Decisions
Data Science : Make Smarter Business Decisions
 
Presentation Machine Learning
Presentation Machine LearningPresentation Machine Learning
Presentation Machine Learning
 
Machine Learning and the Cloud
Machine Learning and the CloudMachine Learning and the Cloud
Machine Learning and the Cloud
 
Probability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning PerspectiveProbability Forecasting - a Machine Learning Perspective
Probability Forecasting - a Machine Learning Perspective
 
J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016J&J Thesis Presentation July 2016
J&J Thesis Presentation July 2016
 
Scope of managerial economics
Scope of managerial economics Scope of managerial economics
Scope of managerial economics
 
SALES FORECASTING METHOD
SALES FORECASTING METHODSALES FORECASTING METHOD
SALES FORECASTING METHOD
 

Similar to thesis_jinxing_lin

Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine LearningLynn Langit
 
Smart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportSmart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportALi Baker
 
predictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUESpredictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUESEMERSON EDUARDO RODRIGUES
 
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGeetha982072
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsAntonio Severien
 
Synergy Platform Whitepaper alpha
Synergy Platform Whitepaper alphaSynergy Platform Whitepaper alpha
Synergy Platform Whitepaper alphaYousef Fadila
 
Synergy on the Blockchain! whitepaper
Synergy on the Blockchain!  whitepaperSynergy on the Blockchain!  whitepaper
Synergy on the Blockchain! whitepaperYousef Fadila
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesShantanu Deshpande
 
M sc bi thesis rafael garcia navarro summary
M sc bi thesis rafael garcia navarro summaryM sc bi thesis rafael garcia navarro summary
M sc bi thesis rafael garcia navarro summaryRafael Garcia-Navarro
 
AIIM White Paper: Case Management and Smart Applications
AIIM White Paper: Case Management and Smart ApplicationsAIIM White Paper: Case Management and Smart Applications
AIIM White Paper: Case Management and Smart ApplicationsSwiss Post Solutions
 
Ubiwhere Research and Innovation Profile
Ubiwhere Research and Innovation ProfileUbiwhere Research and Innovation Profile
Ubiwhere Research and Innovation ProfileUbiwhere
 
Marketing Analytics using R/Python
Marketing Analytics using R/PythonMarketing Analytics using R/Python
Marketing Analytics using R/PythonSagar Singh
 
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT ControlCloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT ControlAshwin V.
 
Digital Twin: A Complete Knowledge Guide
Digital Twin: A Complete Knowledge GuideDigital Twin: A Complete Knowledge Guide
Digital Twin: A Complete Knowledge Guideferiuyolasyolas
 
Best new technology introduced over the last 12 months - Trading & Risk
Best new technology introduced over the last 12 months - Trading & Risk Best new technology introduced over the last 12 months - Trading & Risk
Best new technology introduced over the last 12 months - Trading & Risk CompatibL Technologies ltd
 
BIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn KoBIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn KoNay Linn Ko
 
Documentation on bigmarket copy
Documentation on bigmarket   copyDocumentation on bigmarket   copy
Documentation on bigmarket copyswamypotharaveni
 

Similar to thesis_jinxing_lin (20)

Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Smart Traffic Monitoring System Report
Smart Traffic Monitoring System ReportSmart Traffic Monitoring System Report
Smart Traffic Monitoring System Report
 
Thesis
ThesisThesis
Thesis
 
predictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUESpredictive maintenance digital twin EMERSON EDUARDO RODRIGUES
predictive maintenance digital twin EMERSON EDUARDO RODRIGUES
 
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
 
Mrd template
Mrd templateMrd template
Mrd template
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
 
Synergy Platform Whitepaper alpha
Synergy Platform Whitepaper alphaSynergy Platform Whitepaper alpha
Synergy Platform Whitepaper alpha
 
Synergy on the Blockchain! whitepaper
Synergy on the Blockchain!  whitepaperSynergy on the Blockchain!  whitepaper
Synergy on the Blockchain! whitepaper
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniques
 
Technovision
TechnovisionTechnovision
Technovision
 
M sc bi thesis rafael garcia navarro summary
M sc bi thesis rafael garcia navarro summaryM sc bi thesis rafael garcia navarro summary
M sc bi thesis rafael garcia navarro summary
 
AIIM White Paper: Case Management and Smart Applications
AIIM White Paper: Case Management and Smart ApplicationsAIIM White Paper: Case Management and Smart Applications
AIIM White Paper: Case Management and Smart Applications
 
Ubiwhere Research and Innovation Profile
Ubiwhere Research and Innovation ProfileUbiwhere Research and Innovation Profile
Ubiwhere Research and Innovation Profile
 
Marketing Analytics using R/Python
Marketing Analytics using R/PythonMarketing Analytics using R/Python
Marketing Analytics using R/Python
 
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT ControlCloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
Cloud Integration for Hybrid IT: Balancing Business Self-Service and IT Control
 
Digital Twin: A Complete Knowledge Guide
Digital Twin: A Complete Knowledge GuideDigital Twin: A Complete Knowledge Guide
Digital Twin: A Complete Knowledge Guide
 
Best new technology introduced over the last 12 months - Trading & Risk
Best new technology introduced over the last 12 months - Trading & Risk Best new technology introduced over the last 12 months - Trading & Risk
Best new technology introduced over the last 12 months - Trading & Risk
 
BIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn KoBIT (Building Material Retail Online Store) Project Nay Linn Ko
BIT (Building Material Retail Online Store) Project Nay Linn Ko
 
Documentation on bigmarket copy
Documentation on bigmarket   copyDocumentation on bigmarket   copy
Documentation on bigmarket copy
 

thesis_jinxing_lin

  • 1. CRANFIELD UNIVERSITY Jinxing Lin Application of machine learning techniques for sales forecasting School of Aerospace, Transport and Manufacturing Software Engineering for Technical Computing MSc. Thesis Academic Year: 2014-2015 supervisor: Irene Moulitsas June 23, 2015
  • 2. CRANFIELD UNIVERSITY School of Aerospace, Transport and Manufacturing Software Engineering for Technical Computing MSc. Thesis Academic Year: 2014-2015 Jinxing Lin Application of machine learning techniques for sales forecasting supervisor: Irene Moulitsas June 23, 2015 This thesis is submitted in partial fulfilment of the requirements for the degree of Master of Science © Cranfield University, 2014. All rights reserved. No part of this publication may be reproduced without the written permission of the copyright owner.
  • 3. Contents 1 Introduction 5 2 Literature review 9 2.1 Machine Learning Techniques for sales forecasting . . . . . . . . . . 9 2.1.1 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Support vector machines . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . 11 2.1.4 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . . 12 2.1.6 Extreme learning machine . . . . . . . . . . . . . . . . . . . 12 2.2 Supported technologies . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Scala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Methodology 17 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.1 Data source . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Random forest with regression tree . . . . . . . . . . . . . . . . . . 18 3.2.1 Regression tree . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Time-series forecasting . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Main components in time-series model . . . . . . . . . . . . 22 3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Results 34 5 Analysis 35 1
  • 5. Abstract Nowadays, sales forecasting problems have been considered more and more im- portant by companies and industries. An accurate prediction can significantly help companies and industries understand the future trend of products and therefore make a better sales plan, prepare the production, the stock and the transport of products. All these improvements lead to minimise the cost while satisfying clients’ demands. In other words, supply chains can be well improved with precise sales forecasting. TODO: techniques TODO: conclusions 3
  • 7. Chapter 1 Introduction Recently, Information Technology (IT) is playing a more and more important role in supply chain management (SCM), from basic IT infrastructure in a company to Virtual Enterprise, from storage of data to analysis of data and even sales forecasting [5]. An accurate sales forecasting can be very useful and helpful for companies to make their decisions on the production planning as well as the sales price. It can also help companies effectively distribute their resources, reduce unnecessary cost and provide satisfactory customer service. Meanwhile, the sales forecasting is affected by plenty of factors such as lifespan of product, economic climate, competition and globalisation. In the last twenty years, a lot of researches which aim to improve sales forecasting have been developed and most of them are based on machine learning techniques. First of all, let’s focus on what machine learning (ML) is. In recent years, ML is absolutely one of the hottest topic in computer science. Even non-technical people can’t have escaped the articles, headlines, videos, TV programmes on the rise of big data [13, 19] and machine learning. Briefly introduced, big data is datasets which should fit the following ”3V”: ˆ Volume: The volume of data is so important that it determines the value of datasets. As a matter of fact, traditional data treatments based on only one machine aren’t suitable for the increase of volume of data any more. Thus, new scalable data treatments are developed to fit the increasing volume of data. ˆ Variety: The variety of data means different forms of data in this context. Big data can be data stored in structured database, text, images, vocal messages, videos. And it can be collected from various sources such as internet, mobile phone, personal PC, wearable devices. 5
  • 8. ˆ Velocity: Big data is the data which is generated very rapidly. For example during each trading session, the New York Stock Exchange obtains 1 TB of trade information. The volume of global data is increasing dramatically and it brings more and more challenges in looking for the suitable technologies for supporting this extreme growth. As shown in the figure 1.1 below, in 5 years, the volume of data will be 10 times as what it was 2 years ago. Therefore, the research for appropriate supporting technologies is becoming urgent. The adaptive technology should be capable to store large volume of data such as several TB; it should be capable to calculate in parallel which allows to significantly accelerate the computation and provide high performance; in addition, it would be an attractive point if the tech- nology support data streaming. Two most used platforms designed for big data are Hadoop and Spark. These two platforms use different high performance com- puting techniques to distribute data and jobs. A brief presentation about these two technologies is included in the following chapter. Figure 1.1: Growth of global data 1 The use of big data is expanding rapidly and it’s getting into every aspect of our lives. For example, the police predicts when and where crimes happen based on big data; health center uses big data to predict the coming of diseases; travel agencies use big data to understand customers’ preferences and elaborate attractive travel 1 Source: http://www.emc.com/leadership/digital-universe/2014iview/ executive-summary.htm 6
  • 9. plans; accurate stocks forecasting is becoming possible with big data. Our lives are now surrounded by big data and in order to make sense of big data, to understand its hidden brilliant values, we use machine learning algorithms to explore it. Machine learning brings both computer science and statistic together in order to learn from the data and find out suitable patterns or models for the data. With these patterns or models, it is possible to project them to the future and therefore predict. Globally, machine learning techniques can be categorised by their purpose: ˆ classification: training a model for assigning correctly observations into their classes such as classifying if a customer is loyal or not; ˆ regression: training a model for predicting continuous output such as stocks forecasting; ˆ clustering: without knowing the groups beforehand, training a model for splitting input observations into groups such as image recognising. Nowadays, ML is widely used by all kind of companies. Google is using ML to construct it’s search engine; Facebook is using ML to recommend friends or ad- vertisements to us; Tesco is using ML to distribute coupons to customers; weather forecasting is using ML to forecast the coming days’ weather. After talking about big data and machine learning algorithms, certainly this thesis is about discovering and applying machine learning techniques. There are so many domains that machine learning can be used in and among all these, we find it interesting to investigate how machine learning can play in sales forecasting which is a highly focused subject in business analysis. The aims of this thesis are: ˆ studying machine learning algorithms and understanding their mechanisms; ˆ analysing the given dataset, elaborating a plan of how the data should be used; ˆ looking for pertinent patterns for the dataset and make sales forecasting. For achieving this objective, we will first discover different machine learning al- gorithms deeply, choose some interesting ones which might satisfy the dataset and employ them, customise these techniques in regards with the data and then compare their performance. 7
  • 10. The following chapter is literature review. This literature review is mainly about some machine learning techniques which have been applied in sales forecasting and some commonly used technologies for machine learning. In chapter 3, we will de- scribe where the data comes from, how we pre-process the data and which ML algorithms we apply and why we choose them. After the chapter of methodology, results of different algorithms will be shown with diagrams and charts in the chap- ter 4. The before last chapter will mainly include the analysis of results which mean the comparisons of the performance of different techniques. A conclusion of the research will be included in the last chapter in which we summarise the global process of research and the future expectation for continuing this research. 8
  • 11. Chapter 2 Literature review 2.1 Machine Learning Techniques for sales fore- casting The exponential smoothing model [4] is one of the earliest models applied in sales forecasting as well as the AutoRegressive Integrated Moving Average (ARIMA) model [21]. Forecasting models like Neural Networks (NN) model [21] and fuzzy model [12] were also often applied for sales forecasting. As the volume of data is growing rapidly and the request for accuracy is getting higher, some new ma- chine learning techniques are used on sales forecasting of different type of data. For example, clustering [16] and decision trees [16] are used to develop a sales forecasting system of the Textile-Apparel-Distribution; in order to forecast sales of a new appeal item, neural clustering and classificaiton [17] are applied. In the recent years, Extreme Learning Machine (ELM) has been applied very frequently in combination with other techniques to sales forecasting. The following sub-sections present some interesting techniques that would prob- ably be applied or tested during this research. 2.1.1 Regression Trees Before introducing regression tree, it is important to clarify that decision tree is a predictive model which can be used to evaluate the value of a certain feature of the system from the observation of the other features of the system. Regression tree is a form of decision tree which is specific for numeric data. Basing on the given data, it trains a tree model whose leaves represent groups of instances in the same class and whose branches represent the separation of instances into different groups (2.1). It is a supervised learning method that trains its model with known 9
  • 12. inputs and known outputs. Once the regression is constructed, we can input some new data and the regression tree will train the given new data and output the prediction. Regression tree has been proved to be an efficient tool for sales forecasting in the textile distribution [16]. Figure 2.1: Regression tree 2.1.2 Support vector machines Support vector machines are machine learning models used for regression and classification. The principal of this model is to create one or a set of hyperplane(s) between different classes and the request of hyperplanes is to maximize the mar- gins between classes so that obtain the lowest error of the classifier.Let’s take an example which is shown in figure 2.2. A hyperplane in geometry is a subspace of one dimension less than its ambient space. Often, it is not feasible to separate the data on the original dimension. In this case, projection of the data onto a higher dimension would be applied. In the paper [1], a hybrid system of RNN and SVM was applied to forecast the sales and the results showed that the hybrid system outperformed the simple traditional forecasting techniques such as moving average. 1 Source: http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/ introduction_to_svm.html 10
  • 13. Figure 2.2: SVM with hyperplane 1 2.1.3 Stochastic gradient descent Sometimes, an optimisation of training model would be very useful to make the model more efficient to predict. An often used optimisation algorithm is Gradient descent (GD) [9]. It aims to find the minimum of a function. In order to do this, it follows the opposite direction of the gradient at each point and it takes steps. Here, gradient represents the direction in which a function increases the fastest. A training model, for example a linear regression model, has its cost function, and gradient descent method is applied to train this model and minimise its cost function. Gradient descent method can be roughly divided into two methods. batch gradi- ent descent (BGD) and stochastic gradient descent (SGD) [3,22]. BGD, which has many iterations and in each iteration, trains all the data to calculate the gradient. To be different, in each iteration, SGD only picks a random example to train the gradient. As the training data size in each iteration is smaller in SGD than in BGD and SGD algorithm does not need to remember the examples which have been previously studied, SGD is often considered as a more effective method. In the research [?], Dr.Bottou studied the application of SGD over several machine learning system such as K-Means, SVM, Lasso and it showed that they performed better with SGD. 2.1.4 Neural networks In terms of neural networks, there are various types and the one that has been mostly applied is feed-forward and error back-propagation neural network [11,21]. 11
  • 14. There are at least two layers in neural networks: the input layer and the output layer. One or some hidden layers could be inserted between the input and output layer and each layer contains its elements. As shown in Figure 2.3, the data will be input into the input layer and pass through hidden layers where training algorithms are applied and finally be sent to the output layer. The information only carries on forward in this type of neural network. The network adjusts its weight by using the information fed back by the comparison of the real values and the correct answers. Figure 2.3: Artificial neural network 2.1.5 Recurrent neural networks Recurrent neural networks (RNN) is developed basing on neural networks. The main difference between them is that RNN allows some outputs of neurons to go back and become inputs of other neurons. It means that the information is not flowing in only one direction and the information can be trained several times before being output as indicated in Figure 2.4. Owing to this fact, RNN can perform better than one-directional neural network, for example, in stock price forecasting [1,8]. 2.1.6 Extreme learning machine Extreme Learning Machine (ELM) has been applied to fashion supply chains for sales forecasting [14,20]. Extreme learning machine is an algorithm for single- hidden-layer feedforward neural networks (SLFN). In this algorithm, the input 12
  • 15. Figure 2.4: Recurrent neural network weights and the hidden biases are randomly determined and on the other hand, ELM determines analytically the output weights with Moore-Penrose (MP) gen- eralized generalized size. In comparison with the traditional gradient-based algo- rithms, ELM is more rapid and more effective. In addition, it can avoid some problems, such as stopping criteria, learning weight and learning epochs, faced by gradient learning algorithms. The experimental results in [14] demonstrates that the performance of ELM model is superior to some sales forecasting algorithm based on backpropagation neural network. In some studies, ELM is combined with other machine learning techniques to build forecasting models [2,10,20]. ELM and harmony search algorithm In the research [20], Harmony Search (HS) algorithm was applied with the com- bination of ELM for sales forecasting in fashion retail supply chains. Harmony search algorithm can be integrated with ELM to construct a novel meta-heuristic optimisation algorithm which we can obtain optimal NN weights and have high forecasting performance with. According to the research [20], this hybrid intelli- gent system significantly outperforms traditional ARIMA models as well as two other developed neural network models for fashion sales forecasting. ELM and ensemble empirical mode decomposition Empirical mode decomposition (EMD) is a signal processing technology based on the local characteristics time scales of a signal and it is usually applied to decompose a signal into intrinsic mode functions (IMF) which are finite and small 13
  • 16. number of components. In order to avoid the major problem of EMD which is mode mixing problem, ensemble empirical mode decomposition (EEMD) is developed. EEMD consists in EMD method and a noise-assisted data analysis method and it is designed to alleviate the mode mixing problem. With EEMD, original sales data can be converted into IMFs and the latter will be input into ELM method to forecast sales for computer products [10]. In comparison with single ELM, single support vector regression (SVR), single back-propagation neural network (BPN), EEMD-ELM model is better in terms of performance in developing the sales forecasting of computer product. ELM and Gray relation analysis Gray relation analysis (GRA) can be integrated with ELM to set up a hybrid sales forecasting system (GElM). GRA measures the relative distance, between compared series of data and reference series of data, which is called Gray relation grades (GRG). The ranking of the GRG can show that which factors that affect the sales amounts the most and these influential factors are used as input variable of ELM models. The results of experiments show that GELM hybrid system outperforms BPN and Military Families Learning Network (MFLN) models [2]. 2.2 Supported technologies Nowadays, as we are in the age of big data, more and more technologies designed for big data have come to light. In this section, we are going to present some technologies which have been used the most recently and which we might use in this thesis project. 2.2.1 Apache Hadoop Apache Hadoop [15] is a platform which is designed for distributed storage and distributed processing. It is a platform dedicated to large volume datasets. Basically, Hadoop gets data files as input and it decomposes data into large blocks and distributes them to all the nodes in the cluster. According to different data received by different node, these nodes will receive a packaged code and then execute the computation in parallel. This fact can largely improve the efficiency in comparison with computing all data on only one node. In terms of component, there are four main components in Hadoop: ˆ Hadoop Common: contains library and supports other Hadoop modules; 14
  • 17. ˆ Hadoop distributed file system (HDFS): provides high throughput access and performs the best with large files; ˆ Hadoop Yet Another Resource Negotiator (YARN): a distributed resource scheduler; ˆ Hadoop MapReduce: a distributed processing framework which decomposes work into small parallelized map and reduce worker. 2.2.2 Apache Spark Apache Spark is cluster computing platform which is built for training large dataset. Comparing to one of its competitor, Hadoop, Spark extends the MapRe- duce model (used by Hadoop) to support more types of computation. MapReduce model splits data to discs while data in Apache Spark is split into buffer cache (memory) and because of this, Apache Spark outperforms Apache Hadoop in most cases in terms of speed. Basically, Apache Spark consists of two principal parts: a management system and a distributed storage system. The management system can be Spark stan- dalone or Spark pseudo-distributed mode or Hadoop YARN or Apache Mesos. Same for the distributed storage system, there are a lot of choices such as HDFS, Cassandra, Amazon S3. The platform is written in Scala and its compatible with several languages: Scala, Java, Python and SQL. The high compatibility with the main data process- ing languages help Apache Spark become more and more popular. The main characteristic components of Apache Spark are: ˆ Spark Core which is used for task scheduling, memory management, fault recovery, and providing API to create and manipulate Resilient Distributed Datasets (RDD, a collection of objects distributed into computation nodes in order to be computed in parallel); ˆ Spark SQL which supports a new data abstraction for structured data: SchemaRDD; ˆ Spark Streaming which allows to manipulate data stream; ˆ MLlib, a Machine Learning (ML) library, which provides some common Ma- chine Learning algorithms; ˆ GraphX which is used to perform distributed graph computations. 15
  • 18. Since this thesis is mostly about performing machine learning algorithms on large volume of data, Apache Spark is then chosen as the cluster computing platform where we submit our computing tasks. 2.2.3 Scala Scala is a programming language based on JVM compiler and it supports multi- paradigm at the same time: object-oriented programming and functional program- ming. As a fusion of two programming concept, Scala allows to build structures and elements that treat computation as calling mathematical functions. Meanwhile, it allows to build large system with component abstraction and legible structure. Moreover its open-source which makes it easier for user to use it. As Apache Spark is compatible with Scala and Scala is less verbose in comparison with Java and Python, we chose Scala as our main programming language. 2.2.4 R Spark is an enterprise-used level development tool, as we mentioned before, it’s very efficient for applying basic machine learning algorithms provided in MLlib. But facing deep learning algorithms, it may not be a very smart choice. Nowadays, ”GPU + CUDA” is becoming the main architecture for data scientists to build and run their deep learning programs thanks to the high performance of the way that GPU processes tasks. On the other hand, spark is not a good choice for building customised machine learning algorithm either since it doesn’t contain matrix data structure and most of machine learning techniques require a lot of use of matrices. In this project, some deep learning techniques as neural network and extreme learning machine are expected and building customised algorithms is indispens- able so that we need another tool to perform these deep learning techniques and constructing our own algorithms. After doing some researches on different alter- native tools, we finally decide to choose R. R is a programming language which provides a numerical computing environment (supports matrices arithmetic) and supports performing computations on CUDA GPUs. This fact can possibly speed up the computations. Moreover, thanks to a recently released package - SparkR, it is possible scale R programs in Spark (a distributed fashion). 16
  • 19. Chapter 3 Methodology 3.1 Data 3.1.1 Data source For examining the effectiveness of different machine learning algorithms in sales forecasting, a french surgical equipment company Didactic provides a dataset. This dataset consists of monthly sales of some surgical equipments in the period from October 2013 to Mars 2015. 758 different products are included in this dataset. Each product has its own reference, its daily sales quantity corresponding to each client and its price. In addition, there is also an available dataset of call to tender of surgical equipments in the market. The aim is to build sales forecasting model with information provided in these two available datasets. 3.1.2 Data preparation First of all, the program reads the data from a data file which contains all the sales records of Didactic. Since in the original data file, the data is separated by product, by order and by day and what we want to obtain as result is the total quantity of daily sales of each product, a sum up of all the sales’ quantity per product per day is essential. The following table (Table 3.1) is an example of the data after the sum up of daily data: According to marketing experts in Didactic, seasonality might be an important factor which affects the sales. It means that the sales might vary between a certain range in accordance with the different months in a year. Following this marketing assumption, we will also analyse the sales changes month by month. Therefore, a sum up of sales’ quantity per product per month is needed. You can find out an example of the treated data as follows (Table 3.2): 17
  • 20. reference year month day quantity 1312114 2014 11 13 34 1312114 2014 11 14 26 ... ... ... ... ... 1312115 2014 11 13 15 Table 3.1: Prepared data: daily sum reference year month quantity 1312114 2014 11 543 1312114 2014 12 614 ... ... ... ... 1312115 2014 11 317 Table 3.2: Prepared data: monthly sum Based on the assumption of seasonality, we supposed that there is a relation between the average quantity of four previous months’ sales and the quantity of the following month’s sales. Hence, a computation of the average quantity of four previous months’ sales is request. After another data preparation process, we get data as follows (Table 3.3): reference year month quantity avg4premonths 1312114 2014 11 543 517 1312114 2014 12 614 627 ... ... ... ... ... 1312115 2014 11 317 226 Table 3.3: Prepared data: monthly sum and average of four previous months’ sales 3.2 Random forest with regression tree 3.2.1 Regression tree As we have mentioned in the literature review, regression tress are methods commonly used to train models with given predictor variables x and a continuous response y. With these models, we can predict the value of y for a new given value of x. 18
  • 21. In the essence, the regression tree algorithm analyses the inputs and start the creation of tree from the top (root node) to the bottom (leaves). The process of construction of a tree is a recursive partition. Each node works like a filter. This filter is a question relative to a particular feature and it filters the input instances into sub-nodes or leaves. Actually, those conditions are based on the features of instances. For example ”Is price > £20” or ”Is lifespan < 5 months”. And leaves are the final groups of instances which lead to similar outputs. A point x belongs to a leaf if x is assigned to the corresponding cell of the partition. Trees are growing exponentially with the growth of its depth. Once the partition of instances is done, an average of the target value in all instances will be calculated for each leaf and this average will be the prediction value for all the instances distributed into this group. Then, when we input a new instance of features into this model, it will be going down along the tree till it reaches one of the leaves. And it will carry the prediction target value which is assigned to that leaf. One of the most important advantages of regression tree is its interpretability. As it contains a question relative to a certain feature in each node, the whole model will be very simple to understand, even by people who don’t have machine learning background. It can be considered as a tree contains a lot of conditions. Once a condition is satisfied, the instance goes to one side, if it isn’t satisfied, then it follows the other branch. No matter how big the tree is, it is still very understandable. Once the model is built, it will be fast to make prediction because there is no complex calculation to execute. Implementation As what has been mentioned in the literature review, Spark provides library of machine learning algorithms which includes regression tree algorithms. We apply the regression tree technique to two main data models. The first model is based on the hypothesis that the future sales only depend on the the year and the month. Actually, we consider it like a time-series model. In a time-series model, time (or date in this case) plays the key role and the other factors such as price changes are secondary factors. A secondary factor can also affect the sales but only over a small portion and this small portion is considered as noise. More details about the time-series model will be presented in the next section. After all, in this first model, we have year and month as features and quantity as target. The regression tree algorithm analyses the inputs and build tree like the following example 3.1: 19
  • 22. Figure 3.1: Regression tree with year-month model In this graph, we can observe that year and month are used as features. In each node, there is a question relative to one of these two features and it splits the input instances into sub-partitions. At the bottom, we obtain groups of instances which we call them leaves. There is a quantity value assigned to each leaf and this value is the average of the quantity of the instances fallen in that leaf. After the construction of regression tree, if a new instance of features comes in for the sales forecasting, for example (year = 2016, month = 2), this instance will go from the top to the bottom of the tree and fall into a leaf. Then it takes the qty of that leaf as it’s predictive target value. Comparing with the first model, the only difference in the second model is that it contains one more feature - qty4 (the average of 4 previous months’ sales). It uses exactly the same way to train the regression tree and to predict the target value of new given instances. Results TODO put tuning algorithms’ results 3.2.2 Random forest Overfitting the training dataset is a problem often happens to regression trees. Owing to this fact, random forest is introduced to correct it. Random forest method was proposed and developed by Leo Breiman in 2001 [?]. It represents 20
  • 23. a family of ensemble learning methods for classification and regression. These methods utilise exclusively decision trees as classifiers and output either the class or the mean prediction of each individual tree. During the process of construct- ing decision trees, we introduce a random factor with the use of Bagging and Random Feature Selection. In random forests, all the decision trees operate independently. Each decision tree is built based on a random vector of parame- ters. For example the kth tree in the forest depends only on the vector θk and it’s independent of other vectors. All the trees participate in the final decision. Here is a graphical presentation of random forest structure 3.2: Figure 3.2: Random Forest Bagging Bagging is a method selecting a subset of the training data for the construction of each decision in the random forest. These sub-datasets are called bootstrap. Bagging repeatedly and randomly picks a random sample with replacement from the training dataset. TODO put a figure to demonstrate Bagging Random Feature Selection Random Feature Selection focus on the features of data. It actually selects randomly a fixed number of k features and then among these k chosen features, select the feature which optimise the partition of data. The way to define the optimal feature is to compare the impurity of data assigned to sub-nodes. TODO put a figure to demonstrate RFS 21
  • 24. All in all, bagging is applied to construct random sub-dataset for each decision tree while random feature selection is used to select optimal feature among a sub- group of features during the learning process of each tree. Both these two methods lead to a better machine learning model which does not overfit the input dataset. Implementation Random forest algorithm is provided by MLlib and the way to call it is quite similar to the way of calling regression tree. In order to get the best parameters for our dataset, we tune the algorithms with different values for each parameter such as the number of trees in the forest, the maximum depth possible of a tree, the maximum number of intervals possible of a feature. Result TODO put tuning results here 3.3 Time-series forecasting After applying directly some existing machine learning algorithms over the data, we do get some results which are not bad. But in order to get more control over the learning methods, we decide to build a suitable algorithms for this use case and this dataset. Since the main data that we got is related to date and the marketing experts told us that seasonality might affect the sales, we decide to build a time- series model. A time-series technique looks at only the patterns of the history of actual sales and based on these patterns, predict the future sales [18]. 3.3.1 Main components in time-series model There are four main components we have to take into account when we are trying to set up a time-series model: ˆ level: a horizontal sales history; ˆ trend: a pattern that represents continuously the sales increase or decrease; ˆ seasonality: a pattern that represents repeatedly how sales increases and decreases within a certain period (e.g. one year); ˆ noise: a random fluctuation which might be explained by some features expect times such as price changes or the quality of customer service. 1 Source: [18] 22
  • 25. Figure 3.3: Time-series components 1 3.3.2 Implementation Visualisation of data In order to get a better insight of the data, we have decided to visualise the data in the form of line charts. The first line chart is about the monthly total sales of all the products in Didactic. Actually, we want to have a global view of how the sales varies throughout the time. With this view, we can roughly see if there is a trend or seasonality in the products’ sales in this company. For creating this line chart, we go through each month and sum up all the sales’ quantity. Then we display these sums with their corresponding date so that we can see how the sales evolve with time. Put graphs here After seeing it globally, we also need some line charts which provide some more details. In fact, the time-series models are usually constructed for a product or a family of similar products. Due to this fact, we decide to draw a quantity-date line plot for each of the 20 products who have been sold for the longest time. We take these top 20 products because they have the most sales history and this fact allows us to have a more clear view on how the sales evolves with time. These plots for different products are showing the evolution of daily sales, so they often have sudden peaks or drops (short-term fluctuations). For decreasing the unexpected effect of the short-term fluctuations, we apply the technique - moving average. Moving average is a technique used to smooth out short-term fluctuations and emphasise long-term trends or periodical cycles. Essentially, for each point of data, it creates a small subset which contains data around that point and then 23
  • 26. calculate the averages of data in this subset. These averages become the new values of points. Put graphs here Proposed models There are many possible approach to construct a model for time-series problem. We propose here two different models consisting of trends and seasonal coefficients: ˆ Multiplicative model; ˆ Addictive model. Multiplicative model This model is based on the assumption that target values can be predicted with the multiplication of values on a trend and relevant seasonal coefficients. The following figure 3.5 and formula are one of the multiplicative model: Salest = Trendt ∗ SAt + Noiset where SA = SeasonalAdjustement and t = time(day/week/month/etc). In this model, we consider that a proportional relationship exists between target values and time and values over the trend. Once we have calculated seasonal coefficients and the trend, we can then project these two to the future and forecast a target value for a given input. Addictive model Apart from the multiplicative model, we employ an addictive model as well. The difference is that in this model, we assume that a target value can be calculated by the addition of a value on the trend and a relevant residue. For each observation, there is a corresponding residue and there is a relationship between these residues. Since the fact that the residues are related to each other, the objective is to compute the seasonal coefficients based on residues in each observation and predict the future residues and therefore obtain future target values. As shown in the figure ?? and the formula: Salest = Trendt + Residuet + Noiset where Residuet = Residuet−1 ∗ SAt where SA = SeasonalAdjustement and t = time(day/week/month/etc). , we can observe that the sales quantity is an addition of the trend and the residues and each residue is calculated from its previous residue. In comparison of the multiplicative model, there is a shortage of this model. It might propagate the errors if we want to predict the values for more than one point in the future. Since 24
  • 27. Figure 3.4: Multiplicative model each residue is computed from its previous one and if there is an error somewhere, it will propagate with the process. Therefore the further is the point that we want to predict, the bigger might be the error. Calculation of trend A trend of a product is the pattern shows gradually how the sales increases or decreases. It can be either a line which fits a linear function or a curve. If we have a suitable pattern for a product, we can know roughly which direction is the growth of product going to and how fast is it going. We mainly apply three regression methods to look for the the trend: 25
  • 28. Figure 3.5: Addictive model ˆ linear regression; ˆ kernel regression; ˆ LOcal regrESSion (LOESS regression). Among these three regressions techniques, only the linear regression method is a parametric method. The other two are both non-parametric methods. Linear regression Y = α ∗ X + β is a very simple formula that everyone is similar with and the linear regression is an approach for defining the relationship between a target variable Y and one or several explanatory variables X based 26
  • 29. on this simple formula. The aim is to calculates the correct values for α and β with which it leads to smallest difference between the expected values and the calculated values. In our case, the target variable is the sales’ quantity and there is only one explanatory variable which is the date. Therefore, the linear model that are looking for is as follows: Quantity = α ∗ Date + β where Date = Y ear +Month/12 (transforming the date into a number). Once we get the value of α and β, we can set up the linear function. Input a new value of Data to this function, we can expect an output of Quantity. put graph here Kernel regression As a non-parametric method, kernel regression aims at find- ing a non-linear relation between an explanatory variable X and a target variable Y . The regression function for Y about X is m(x) = E(Y |X) where m(x) is the estimator of the regression function. Meanwhile, there are differ- ent available estimators for the regression function and we only pick a commonly used one - Nadaraya-Waton estimator: ˆmh(x) = n i=1 Kh(x − xi)yi n i=1 Kh(x − xi) where k is a kernel which is used as a weighting function and h is a bandwidth. For example, K(x) = 1√ 2π e−(x)2 /2 is a commonly used kernel function. The estimator function can be transformed into f(x) = n−1 h−1 i=1 nK( x − x−1 h ). Let’s take an example where X has n points. At each point, the kernel regression technique takes n∗h points around that points and applies the weighting function K(u) on their Y values. Then we can take the average of these weighted Y values and the average will become the new target value of that point. In the essence, it is smoothing the values with a kernel function. Thus it can build a model which fits well the given data. put graph here LOESS regression The other non-parametric technique we employ is LOESS regression. Basically, as in Kernel regression, an action is operated at each point. LOESS regression fits a low-degree polynomial function to a subset of the data at 27
  • 30. every point in that data set. A subset of data is the neighbourhood points of a point. The way to fit the polynomial is using weighted least squares with which the central points gain more weight and the further points on both sides gain less weight. It is also sort of data smoothing. The size of subset of data is determined by the bandwidth as in the kernel regression. In comparison with other regression methods, LOESS regression doesn’t need a specific function to a model, it only applies polynomial on each subset of data. Moreover, its flexibility makes it one of the best choices for sophisticated data models. In the other hand, it is a very computationally expensive technique. put graph here Calculation of seasonality As we have mentioned before, in time-series models, there is another component which varies the values and that is the seasonality. Some products are well relative with seasons or months or weeks or even days. For adjusting the model to this kind of products, we need to introduce the seasonal adjustments. Seasonal adjustments are in fact coefficients which are calculated for each time-period and which the model can make its values more adaptive to real data. We use two approaches to compute these coefficients for each month in a year. Let’s assume that there might have a relation between the current month’s sales quantity and the next month’s sales quantity. It can be illustrated by the following formula: Salesm = SAm ∗ Salesm−1 where SA = SalesAdjustment. We need make to add a new column consists of previous month’s sales quantity to the data table as shown below 3.4 so that we can compute the coefficients for each month. reference year month quantity quantity1 1312114 2014 11 534 556 1312114 2014 12 614 534 ... ... ... ... ... 1312115 2014 11 317 289 Table 3.4: Prepared data: seasonal coefficients computation solution 1 In this case the first month of each reference can not be taken account into the calculation of seasonal coefficients since they don’t have previous month’s sales. 28
  • 31. In the second solution, the hypothesis is that a relationship exists between the current month’s sales quantity and the average of its 4 previous months’ sales quantity. Hence it drives to: Salesm = SAm ∗ avg(Salesm−1 + Salesm−2 + Salesm−3 + Salesm−4) . As the previous solution, we need to transform the data table. After the transformation, we have a table as follows 3.5: reference year month quantity quantity1 quantity2 quantity3 quantity4 1312114 2014 11 534 556 515 478 533 1312114 2014 12 614 534 533 515 478 ... ... ... ... ... ... ... ... 1312115 2014 11 317 289 313 337 329 Table 3.5: Prepared data: seasonal coefficients computation solution 2 As a result of the need of 4 previous months, we cannot use the first 4 months’ sales to compute the coefficients. Clustering We meet a difficulty in terms of data, actually the data we have is not enough to train one series of seasonal coefficients for one product. For each product, the maximum historical sales we have is of 1 year and a half. For some recently come out products, we only have less than one year historical sales. As there are less than several years sales records for each product, if we calculate the seasonal adjustments based on only one product, these coefficients are going to overfit the data. To be more detailed, the model will consider that sales variation in next year will be exactly the same as this year and this is quiet rare in the reality. To deal with this problem of lack of information, we brought in a machine learning technique - clustering. The objective is to use clustering to gather products have similar variations about time and put them in the same subset. Then we compute a series of seasonal coefficients for each subset rather than a series of seasonal coefficients for each product. Let’s focus on the mechanism of clustering techniques. Clustering is an unsu- pervised learning method organising a group of objects that share similar charac- teristics. Imagine we have a large set of data and we want to split them into some subsets of data so that we can find learn from each subset instead of learning from just one huge set of data since learning from a smaller set of data is sometimes 29
  • 32. easier to obtain expected information. The following figure 3.6 shows the partition of data into clusters: Figure 3.6: Clustering Clustering which aims to find structure within a given set of data can be applied in this case. There are some commonly used clustering models [6] such as: ˆ Centroid models: k-means; ˆ Connectivity models: hierarchical clustering; ˆ Distribution models: Expectation-maximisation; ˆ Graph theory model: Highly Connected Subgraph (HCS) algorithm. Among these techniques, we are interested in the application of k-means algorithm and hierarchical clustering. K-means The k-means is the most commonly used clustering algorithm and it’s quiet easy to understand [7]. Assuming that we have a dataset contains n objects, the objective of the k-means algorithm is to define k clusters. Each one of the clusters contains objects which have similar behaviours and characteristics. In each cluster, there is a centroid (mean) represented by a point where the distance of the the objects will be calculated. The criteria of partitioning objects into clusters is to minimize the within-cluster sum of distances: min k j=1 nj i=1 ||xi (j) − cj||2 30
  • 33. , where nj is the number of objects in the cluster j, xi (j) is the ith object in the jth cluster and cj is the centroid of the jth cluster. First of all, for initialising, the algorithm assigns a cluster to each object. There are two commonly used initialising methods. The first one is random partition method which randomly assign a cluster to each object and then calculate the distances and therefore the initial centroid (mean) of each cluster. It tends to place the initial centroids close towards the center of the dataset. The other method is Forgy method which picks randomly k observations from the dataset and in comparison with the random partition method, this method spreads the initial centroids out over the dataset. After the initialising clusters to objects, the algorithm goes through all the objects and for each object, it computes the distance between the object and every centroid and it assigns the object the nearest cluster. The distance measurement used in this algorithm is Euclidean distance: distance = n i=1 (xi − ci)2 , where n is the number of dimension. Once the algorithms has gone through all the objects and finished the reassign- ment, it goes the the next step - update. The aim of this step is to update the centroid of each cluster. Since during the reassignment, in each cluster, there are some objects coming in and some going out, the centroid has to be recomputed with the new objects. The process repeats the reassignment step and the update step as a loop until the entities in each group don’t change any more. This means each observation finds the cluster it belongs to and in this cluster, there are other objects which it is similar with. Hierarchical clustering Different from the k-means algorithm, hierarchical clustering aims to set up a hierarchy of clusters and it doesn’t require predefined parameters. There are two means (as shown in the dendrogram 3.7) to build the hierarchy of clusters, either by an aggregative way or by a divisive way. The ag- glomerative way starts from individual observations at the bottom and gradually aggregates observations into clusters and aggregates clusters into larger clusters until the top of hierarchy. On the other hand, the divisive way is completely op- posed to the aggregative way which is a ”top down” model. It starts from the top and splits gradually until the end observations. 31
  • 34. Figure 3.7: Hierarchical clustering The biggest disadvantage of hierarchical clustering is its huge complexity. Gen- erally, the agglomerative goes with a complexity of O(n3 ) and the complexity of the divisive method is O(2n ). Both of them are very expensive and thus require lots of computations. When the number of observation is big enough, the agglomerative method will be less computationally expensive than the divisive one. In the agglomerative method, how can we decide the combination of different clusters? To do this, we introduce distance as a measurement of similarity between each two clusters. Different distance computation methods affects differently the shape of the clusters because based on different distance standard, whether some observations are closer to each other or further away from each other is different. We list the commonly used distance as follows: ˆ Euclidean distance: i(x1i − x2i)2; ˆ Maximum distance: maxi ||(x1i − x2i)2 ||; ˆ Manhattan distance: i |(x1i − x2i)2 |. Besides the distance, agglomeration method is another factor to determine how close are two clusters (or observations). The following list consists of some commonly used agglomeration methods: ˆ Complete method: max dist(c1, c2) : c1 ∈ C1, c2 ∈ C2; ˆ Single distance: min dist(c1, c2) : c1 ∈ C1, c2 ∈ C2; ˆ Average distance: 1 |C1||C2| c1∈C1 c2∈C2 dist(c1, c2); 32
  • 35. Once the hierarchy is constructed, we can choose the number of clusters that we wish as shown below 3.8 Figure 3.8: Hierarchical clustering with determined clusters 33
  • 39. References [1] Real Carbonneau, Kevin Laframboise, and Rustam Vahidov. Application of machine learning techniques for supply chain demand forecasting. European Journal of Operational Research, 184(3):1140–1154, 2008. [2] F. L. Chen and T. Y. Ou. Sales forecasting system based on Gray extreme learning machine with Taguchi method in retail industry. Expert Systems with Applications, 38(3):1336–1345, 2011. [3] Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. Large- scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’11, pages 69–77, 2011. [4] Michael D. Geurts and J. Patrick Kelly. Forecasting retail sales using alter- native models. International Journal of Forecasting, 2(3):261–272, January 1986. [5] a Gunasekaran. Supply chain management: Theory and applications. Euro- pean Journal of Operational Research, 159(2):265–268, 2004. [6] a. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. [7] Anil K. Jain. Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31(8):651–666, 2010. [8] Ken-ichi Kainijo and Tetsuji Tanigawa. Stock Price Pattern Recognition - A Recurrent Neural Network Approach -. Architecture, pages 215–221. [9] Jyrki Kivinen and Mk Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132:1–63, 1997. [10] Chi-Jie Lu and Yuehjen E. Shao. Forecasting Computer Products Sales by Integrating Ensemble Empirical Mode Decomposition and Extreme Learning Machine. Mathematical Problems in Engineering, 2012:1–15, 2012. 37
  • 40. [11] James T. Luxhø j, Jens O. Riis, and Brian Stensballe. A hybrid econometric- neural network modeling approach for sales forecasting. International Journal of Production Economics, 43(2-3):175–192, June 1996. [12] Paris A. Mastorocostas, John B. Theocharis, and Vassilios S. Petridis. A constrained orthogonal least-squares method for generating TSK fuzzy mod- els: Application to short-term load forecasting. Fuzzy Sets and Systems, 118(2):215–233, March 2001. [13] Sherri Rose. Big data and the future, 2012. [14] Zhan-Li Sun, Tsan-Ming Choi, Kin-Fan Au, and Yong Yu. Sales forecasting using extreme learning machine with applications in fashion retailing. Deci- sion Support Systems, 46(1):411–419, 2008. [15] The Apache Software Foundation. Apache Hadoop. Accessed 17/05/2015. [16] S´ebastien Thomassey and Antonio Fiordaliso. A hybrid sales forecasting system based on clustering and decision trees. Decision Support Systems, 42(1):408–421, 2006. [17] S´ebastien Thomassey and Michel Happiette. A neural clustering and clas- sification system for sales forecasting of new apparel items. Applied Soft Computing Journal, 7(4):1177–1187, 2007. [18] John T.Mentzer and Mark A.Moon. Time Series Forecasting Techniques. 2004. [19] Mircea Rducu TRIFU and Mihaela Laura IVAN. Big Data: present and fu- ture. Article provided by Academy of Economic Studies - Bucharest, Romania in its journal Database Systems Journal., 5(1 (May)):32–41, 2014. [20] W. K. Wong and Z. X. Guo. A hybrid intelligent model for medium-term sales forecasting in fashion retail supply chains using extreme learning machine and harmony search algorithm. International Journal of Production Economics, 128(2):614–624, 2010. [21] G.Peter Zhang. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50:159–175, January 2003. [22] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, volume 6, page 116, 2004. 38