1. The Use of Neural Networks
for Tendency Prediction in
Financial Series
September 20, 2014
Estudio del uso de redes neuronales en la predicción de
tendencias en series de nanzas
Proyecto n de carrera
Universidad Politécnica de Valencia
Escuela Técnica Superior de Ingeniería Informática
Author: Juan Francisco Muñoz Castro
Director: Salvador España Boquera
Co-director: Francisco Zamora Martínez
i
2. Abstract
In the present project, a comparison of dierent types of articial neu-
ral networks has been used to analyze their behavior with noisy time series
prediction, with the goal of maximizing the benet obtainable by investing
in them. To do so, a wide range of datasets has been used, containing stock
market prices until September 2014 and starting from January 2000 on-
wards. The starting experiment has been a regular multilayer perceptron
using a sliding window of the latest values as the input of the network and
three outputs representing three possible actions as buy, sell or keep. Fur-
ther experiments have been tested, such as the replacement of the three
outputs classier by a single one, converting the system in a forecasting
model with only one output, or the use of dierent averages of recent val-
ues instead of a simple sliding window as the network's input. Also, it has
been tested the use of a single dataset from where each sample is used
rst to test and validate, and to train the network later on in a new step
instead of the traditional way of training-validation-test splitting of data.
Finally, two new models that seize all the data have been tested, one with
a specic period of data validation, and the other one with an implicit
period, as it has been skipped by doing some networks pre-training. After
a comprehensive applying of these methods to the time series, certain pre-
dictability was found. Some networks were able to predict the direction
of change for the next day with an error rate of around the 40%, which
in some optimistic cases decreases to about 30% when rejecting examples
where the system has low condence in its prediction. A practical simu-
lation has been explained, showing an average gain close to the 0.33% by
acting the half of the times.
ii
4. 1 Introduction
1.1 Motivation
Since the existence of a stock market exchange, this has been one the most
important indicators or even predictors of the economy in worldwide terms.
With an average daily trading value of 169 billion dollars during 2013 just in the
New York Stock Exchange, this indicator shows how important for the economy
is. Because of this, so many attempts to predict it have been made, some more
successful than others, but never with outstanding results. In fact, the idea
that the market is completely unpredictable is widely accepted, mainly because
its value is driven by news, which is unpredictable by denition, and would
make following values of the stock market depend exclusively on the present and
future, never on the past. This idea is asserted as the ecient-market hypothesis
(EMH), which states that stocks always trade their fair value on stock exchanges,
making it impossible for investors to either purchase undervalued stocks or sell
stocks for inated prices.
In contradiction to the EMH, there are two main types of analysis: funda-mental,
which is the process of looking at a business at the basic nancial level;
and technical, which is the methodology for forecasting the direction of prices
through the study of past market data.
Numerous articles have been published, based in the technical analysis, sev-eral
of them using Articial Neural Networks, which show certain predictability
in these nancial series in contrast to the previous statement, which enhances
the initial motivation of this project. In this area is where the present project
has been developed, trying to predict the tendency of dierent recent stock mar-ket
instruments, comparing the results for each technique as well as determining
the predictability of the dierent instruments coming from dierent scopes.
Apart from merely technical reasons, there is the basic root motif consistent
on the nancial gain. A system able to determine the trends of the market with
good reliability is an extraordinary tool that many investors and researchers are
continuously looking for in the search for the high benets of their investments.
1.2 Objectives
The objective of this project is to experiment, analyze and explain how dierent
types of Articial Neural Networks can predict future values of nancial series,
based on the technical analysis that simply uses the historical prices.
Provided with datasets of daily market data, will be assumed that one action
can be carried out per day at the stock market opening time that will be canceled
at the end of the same day. With this premise, the main objective is to maximize
the benet by investing a given amount of $100, in terms of the information
used as the ratio of benet per movement or the percentage of success from a
nancial angle. From a more technical perspective we will analyze the behavior
of the parameters that aect the evolution of the Neural Networks, both input
parameters and output measurements.
1
5. 1.3 Stock market basics
First of all the denition of a stock exchange will be given, which according
to Wikipedia is a form of exchange that provides services for stock brokers and
traders to trade stocks, bonds, and other securities. There are two possible ways
of taking part in the stock market:
ˆ Buying stocks: the current price of the stock is paid and whenever this
stock is sold the money worth of that stock is simply given to the investor,
so if the stock has increased its price this dierence will be gained, if it
has decreased its price, the dierence will be lost.
ˆ Short selling stocks: in this case, the investor is lent stocks that are sold
instantly, with the commitment that he will have to give these stocks
back; therefore it will be necessary to eventually buy them again in order
to return them to the lender. In colloquial terms, it can be said that this
is a betting for the stock market to go down; the lower down it goes, the
more benets the investor gets, but also the further up it goes, the more
money will be lost.
ˆ It could be considered as a third action to stay away of the market as
there is no need to always be actively participative in the market, and this
is probably the most important part of investing; knowing when to stay
away. This way money is kept, so there is no risk as well as no possible
benet.
Any non-professional investor can freely buy and sell any kind of instruments
using a broker as an intermediary, which typically is a computer software. There
are plenty of available programs online, and they mostly work with commissions,
meaning that they keep a small amount of money for each transaction the client
makes. This is one of the main obstacles found if someone wants to get hands
on with non-professional investing, the initial negative odds. A standard broker
charges around the 0.01% of each transaction, either if buying or short selling
stocks, to have an initial idea of the taxes these programs operate with. On one
hand, in the long term this will become a large amount of money taken, and
on the other the fact that a random investing strategy when the market keeps
stable in a long term, will be very prone to end up with loses.
1.4 Structure of this report
The present document will be divided as follows: In section 1 a brief introduction
has been exposed, together with some basics of the stock market; section 2
will explain the basics of the time series prediction, mainly regarding neural
networks. The experimentation process will be explained in section 3 and the
models used during this process in section 4, with their results in section 5.
A combined model will be explained in section 6; an overview of the problems
found will be shown in section 7; and at the end of the document the conclusions
and some future interesting work will be exposed, in sections 8 and 9.
2
6. 2 Time series prediction
2.1 Articial Neural Network basics
Before starting to go through the background of the dierent approaches, a
quick overview of Articial Neural Networks (ANN) should be given, due to it
being one of the basic common tools of several approaches. An ANN is a com-putational
model capable of machine learning generally presented as systems of
interconnected neurons which can compute values from inputs. These neurons
harbor numerical values and are typically grouped in sets called layers. A min-imum
of two layers is needed to set a neural network, one to read the inputs,
with one neuron per input value; and another one to write the outputs, with
one neuron per output as well.
One of the most popular types of network is the multilayer perceptron, where
every neuron of a layer is connected in only one direction to every neuron of
the following layer, so that each neuron is reached by the all neurons of the
predecessor layer and reaches all the neurons of the following layer if any. Every
layer that is not the input or output one is called a hidden layer, and an ANN
can consist of one or more hidden layers. Figure 1 shows the architecture of
these multilayer articial neuron networks:
Figure 1: Basic Articial Neural Network with one hidden layer.
In the Figure is shown an Articial Neural Network with an input layer, X
of n neurons; a hidden layer Z with p neurons; and an output layer Y with
m neurons. Each single connection contains a weight, shown in the graph as
V or W and two subscripts representing the reached and the reacher neurons'
3
7. positions in their corresponding layers. To compute the output layer neurons'
values, the following formula applies to every neuron of every layer sorted in
order from the input to the output ones, and updating their values pi by a after
the formula is calculated:
iX=m
a = f(
i=1
(pi wi) + b)
This way each layer requires the completion of the predecessor layer's com-putations.
Also, a nal formula f is applied to the output value with the aim
of reaching a better or quicker learning. Typical formulas are linear or sigmoid
[6], in order to emulate the behavior of the step function, which will provide a
more aggressive learning, as it would always be either 1 or 0. The formula of
the sigmoid function is as follows:
(t) =
1
1 + e
8. t
Where the greater the beta is, the closer the sigmoid is to the ideal step
function, but a too large beta will lead to a longer computational time. In the
Figure 2 the dierence between a sigmoid function with beta=1 and the step
function can be veried.
Figure 2: Sigmoid (left) and step (right) functions.
As the training of the networks is a bit more complex and is not essential for
the understanding of this document, we will not go into too much mathematical
detail. Just to mention that the most common way to train the network is with
a backpropagation of errors, starting from the output until the input layer,
where a gradient of a loss function is calculated with respect to all the weights
in the network. This gradient is afterwards used to update the weights of the
connections, together with some parameters such as the learning rate or the
momentum, which tunes the network with the aim of transforming it into a more
accurate one. Further information can be found in plenty of books and articles
[4][10][11][13]. The learning rate is a ratio that is multiplied by the gradient to
update the weights of the neurons. It inuences the quality and speed of the
4
9. training: the greater the learning rate is, the quicker the network will learn, but
the lower the ratio, the more accurate training. In Figure 3 a small learning rate
is shown in the left, where the problem converges very slowly, and a learning
rate too big is shown in the right image, where the problem diverges. Both
dierent learning rates are applied to a same problem where the aim is to nd
the minimum error (x axis) with dierent results.
Figure 3: Repercussion of a small value for the learning rate (left) and a too
large one (right) over a training curve.
The momentum is a parameter that represents what could be called the
inertia of the learning, extending the actual learning in a proportion given by
this parameter. A momentum equal to zero does not aect the original learning
of the net, and a greater momentum allows it to train faster and might avoid the
network getting stuck in local minimums. On the other hand, a momentum too
big means that the ANN will learn too fast and will probably miss the global
minimum that the network is looking for.
The utility of the neural networks mainly resides in the fact that they can
be used to infer a function from observations. Dierent scopes where articial
neural networks can be applied are as wide as pattern recognition, game-playing
decision making, spam ltering, sequence recognition and many more.
2.2 Dierent techniques
For the concerning problem, a lot of approaches have been made in order to
predict the tendency of markets. In terms of Articial Neural Networks, most
articles focus on Recurrent Neural Networks, which are a kind of network where
connections between units form a directed cycle. This creates an internal state
which allows the network to exhibit dynamical temporal behavior [7]. These
kinds of networks are suitable for predicting time series, but their main with-drawal
is the diculty they have in converging, which becomes a bigger problem
with high noisy series such as stock market ones. Dierent processes have been
applied to these networks to improve their results like self-organizing maps or
grammatical inference [9].
5
10. Other techniques that dier from ANN have been used as well, such as Sup-port
Vector Machines [12], Genetic Algorithms [8], or combinations of dierent
models, techniques and approaches in order to maximize the results. Popular
models in this area to combine results are boosting and bagging [15], which are
like an add-on of the initial models to try to perform better.
2.3 Proposed approach
After verifying several types of approaches together with their results and com-plexity,
the decision made was to start the experiments with a simple regular
multilayer perceptron (MLP), using backpropagation as its training method.
Just a regular neural network is a relatively simple tool that has a good pre-dictive
potential if the data is well organized, and this together with the fact
none of the methods mentioned in the above section have shown outstanding
results even though they are more complex, leads to the use of an initial MLP
to perform this task. Afterwards, some modications will be added to the basic
model with the objective of improving its performance, which will be explained
in further sections, and comprise things as replacing the initial input layer from
a list of values by dierent averages of the values, or the output layer from a
binary vector to a single rational number. Additionally, modications in the
architecture will be considered, as well as an exhaustive scan of the dierent
parameters that might aect the results obtained. Slightly more complex mod-i
cations will be made, like the substitution of the traditional way of splitting
the data to train and test the network by a new model where an overlapping
of samples is considered with the goal of seizing the data better, or a hybrid
system in between the traditional model and the overlapping one.
6
11. 3 Experimentation process
3.1 Tools used
To carry out all the experimentation of the project, many tools and elements
have been considered and several of them used. One of the most important parts,
as mentioned in previous sections, has been the platform of Yahoo Finance,
which gathers historical data from the main stock markets and allows anyone
to download it. About the software used, the rst attempt was to use Theano, a
Python library, but after a few experiments it was decided to change to APRIL-ANN
[1], which is based in the scripting language LUA [2]. It was mainly chosen
because it is developed solely for working with Articial Neural Networks, in
order to improve in terms of eciency. It was additionally chosen due to the
fact that both director and co-director of the present project are taking part
in its development. All the pre-process and post-process of the data has been
done with Python, for the mere reason of familiarity with it and it being a
powerful scripting language. Dierent external Python libraries have been used
for dierent purposes, such as the library urllib for downloading the data from
the yahoo platform, the library csv for working with such les, multiprocessing
and threading to speed up the process or typical handy Python libraries as math
or collections. These were all set in a Intel Core 2 Duo (2,00 GHz) with 4 GB
RAM, running Ubuntu Linux 13.10.
3.2 Basic strategy and data used
The stock market oers many possibilities, permitting investors to buy, sell and
keep whatever and whenever he wants. It is for this reason that is needed to
put some boundaries in the system before establishing a forecasting model, so
that it can be studied more easily. Given that one of the most popular sources of
public historical stock market data is Yahoo Finance, this platform will be used,
as it has daily data available since the early nineties. The daily data provided
by Yahoo Finance has for each day its day, opening value, max value, min value,
closing value, its real volume and an adjusted closed value, which is the closing
value modied when dividends are paid. Back to the system, the boundaries
will set as follows:
ˆ From the historical data, only the date and the percentage of change will
be used, which is calculated as the relative dierence between the adjusted
closing value of one day in respect to the same value of the day before.
ˆ The investing strategy will be to perform an action at the opening time
and keep it until the closing time of the same day. This means that the
adjusted closing value of the day before will be used as the initial stock
value and the adjusted closing value of the current day as the last value.
ˆ The focus will be put in trying to predict the direction of change of the
market, instead of predicting the value itself. Empathizing on the practical
7
12. nancial way more than in the precision of the predictions, although both
measurements are closely related.
ˆ All the historical data until one point is available to predict the direction
of change of that point, meaning that if tomorrow's change wants to be
predicted, all the data until today would be available.
ˆ The initial date used as the beginning of the data will start from dierent
points in time for dierent experiments, but it will never be older than 1st
January 2000.
ˆ For the rst experiments a stock from the Spanish stock market IBEX 35
has been used, arbitrarily chosen by alphabetical order as a regular stock,
Abengoa Abertis. Other series have been analyzed below to get a better
understanding of the series' predictability.
3.3 Performance measurement
The rst thing that has to be set is a common evaluating model for all the ex-periments.
Talking about ANN, the main measurement is the error obtained in
validation and test datasets, which represents how well the network has learned
the samples of these datasets. Typical errors that have been used in this ex-periment
include the Mean Squared Error (MSE) which is the square of the
dierence between the estimator and what is estimated; or the cross-entropy
error method, which gives an estimation of how similar two distributions are.
From a more nancial point of view, dierent ways of measurement are
needed, which go further than purely mathematical ones. One of them is the
percentage of success, which basically is how often the selected action is right.
The main disadvantage of this method is that not all the actions have the same
eect, for instance, assuming four days when the market goes up 0.1% and a
fth when it goes down 3.4%. The success rate would be of the 80%, but more
than the 3% of the money would have been lost. This is not the most common
of the cases, but it is something worthy of being considered.
Another way of measuring the eectiveness, which has been used in several
articles regarding stock market prediction is a simulation of the actions. Sup-posing
an initial capital of $100, the actions predicted by the system are applied
to this amount of money, which will be modied according to the real series'
uctuations. This method gives a very simple idea of how the system performs.
The main disadvantage of this method is that it does not consider the number
of actions performed. For instance, a nal amount of $115 can be fairly good
if just 10 actions have been undertaken, but it is a terrible result when 400
actions have been performed, mainly because of the tax applied by the brokers,
as explained previously, which will end up in a loss of money. It can be pro-posed
as a solution to this disadvantage to divide the dierence between the
initial $100 and the nal amount by the number of actions undertaken, but the
problem would then be that the more money you are moving, the more impact
this action receives, which would not be fair either.
8
13. As a last practical error measurement we can use the average rate obtained in
the simulation. Each time an action is performed the original dierence would be
added to the rate if the action is right and subtracted if it is wrong, dividing this
value by the total number of actions performed. This way an average percentage
of the gain per transaction will be obtained with the main disadvantage of not
knowing the number of actions. For instance, a rate of 0.7% in 50 actions over
a period of 3 months is better than a rate of 1.2% in the same period when just
one action has been performed. Something to consider here is that a positive
rate does not always result in benets at the end. An example of an extreme
case would be, starting with $100, rst getting a +60% ($160) and then losing
the 40% ($64) would mean a sublime ratio of +20% but a loss of $4 at the end.
To sum up, there is no perfect error measurement for this problem, but there
are several ways that combined can give a very good idea of how the system
performs. All of them will be used in order to contrast the results obtained in
each of them, mainly focusing on the last one of ratio of benets but always
keeping an eye on the number of movements carried out.
Lastly, the results will be compared with few simple strategies as a Random
Walk or the evolution of the market, which would be buying the stocks the rst
day of the determined period and keeping them until the last.
3.4 Data preprocessing
Before starting with the experimentation itself, the data must be shaped in a
way that can be easily read by APRIL-ANN. The rst thing to do is down-load
the historical nancial series, as we mentioned above, of Abengoa Abertis
(ABE.MC in Yahoo Finance), due to it being a regular share in the Span-ish
Stock Exchange, and the data interval will be from 1stJanuary 2000 to
1stApril 2014. The period of time used only to predict will be starting on the
1stSeptember 2013 onwards, or a total of 7 months or 151 days of activity in
the Madrid Stock Exchange, and will be the same for all series, such that the
results can be compared afterwards.
The rst thing to do is to represent each single day of this more than 14
years series as its date plus a single number representing the relative dierence
in respect to the day before. With this, we are losing the rst and last elements
of the series, because we do not know the dierence between the 1stJanuary
2000 and the last day of 1999, and the same applies to April 2014, leaving us
now 150 days of activity. Nevertheless, it is still far better than using absolute
prices of the shares, which can vary in terms of magnitude in a matter of days.
Once each single day's price dierence of the series is calculated, the input
and output of the network have to be generated from them. In the rst and
most basic experiment, a regular multilayer perceptron will be used, where the
input of the network consists in a sliding window along the series of length N.
With this method, the input of the network will be the values from the time
t-N to t for predicting the value of t+1 as the Figure 4 shows:
9
14. Figure 4: Time line showing the sliding window used in order to predict the
values of t+1 and t+2 respectively
When the value t is available, the window from t-N to t is used to calculate
t+1, and when t+1 is available, the window slides one position, from t-N+1 to
t+1 in order to calculate t+2.
The output used in this rst model consists of a binary vector of three
elements for each sample, representing the ideal action to perform on that day,
according to the tendency of the series: down {1,0,0}, remains {0,1,0} or up
{0,0,1}. As the aim is to maximize the benets, the market going down will be
understood as a sign of selling, the market remaining constant as do not perform
any action and the market going up as buying. The threshold used to decide
when to remain is when Abs(value) 0.65, meaning that the ideal action would
be buying when the share increases its price more than 0.65%, selling when
the share's price is -0.65% or lower, and remain inactive otherwise. With this
distribution there will be approximately one third of each of the actions along
the series.
Another matter to consider is the initial length of the series to analyze, as
well as other parameters like the size of the sliding window N. Dierent values
for both these parameters will be tested and analyzed in further sections, but
the concern for the moment is the repercussions of these parameters in the nal
length of our series. A starting date for the series will be needed, meaning that
no data prior to that date will be available for the experiments at all, and the
window size will need some initial data before the rst sample is available. For
instance, if the starting date is the 1stOctober 2012 and the window size is 4,
the rst sample available will be on the 4thOctober, because the rst 4 days will
be used to generate this sample. The second sample will contemplate the values
from the 2ndto the 5thof October and so on. To sum up, it should be kept in
mind that the window size has to be subtracted from the initial length of the
series to obtain its nal length, something that has the potential to cause some
problems if it is not considered, mainly when using big window sizes and/or
recent starting dates.
The last important part of the preprocessing of data would be the splitting
of the data. As previously mentioned, this rst experiment will be a simple mul-tilayer
perceptron, such that data must be split in three datasets; one for the
training of the network, another for a rst validation of this trained network,
and a third dataset for a second validation of the system, which comprises a
xed length from 1stSeptember 2013 to 31stMarch 2014 in all the experiments,
regardless of the size of the other datasets used. The remaining data, includ-
10
15. ing all the samples older than the second validation period will be split into
training-validation 1 with the proportion 0.75 for the rst, and the remaining
0.25 for validating the trained system. The data from April onwards will be
used afterwards to test the network selected in base to its performance in the
validation 2 dataset, as can be seen in the Figure 5.
Figure 5: Time frame where the splitting of the data is shown for an undeter-mined
starting date.
The problem mentioned previously can appear with this way of splitting
data, as depending on the starting date and the window size the number of
samples might not be enough to cover the whole set of needed dates for the
second validation. Or the remaining data for training and rst validation might
not be large enough after removing the validation 2 samples. In these cases,
the experiments will simply not be considered. For example, it would not make
sense to set an experiment with data from the 1stJuly 2013 and a window size
of 30, basically because the dataset for training and validation 1 will be of just
14 samples (10 training and 4 rst validation), and the samples for the second
validation will still be 150.
As a nal comment, it should be remarked that the data classied as vali-dation
2 is the typical testing dataset in the train-validation-test splitting, but
a further test dataset will be used, and the best parameters will be chosen in
order to maximize the results obtained within this validation 2 period. The real
potential of the experiments will be shown in the test dataset in a more recent
period from the 1stApril 2014 onwards.
3.5 Post-process of the data
Another highly important point of the experimentation is to process the data
after the networks have been trained, matter covered along this subsection.
First, immediately after the execution of the training, the second validation
dataset will be processed by the best network according the rst validation
dataset and its error outputted in a summary le that will be created for each
single conguration of parameters, where information as the evolution of the
training is kept, such as the epoch where the best net had been trained or both
validation datasets' errors.
The dierent performance measurements are calculated on the validation 2
dataset as well. First, for each sample of the dataset in a sorted order, its pre-dicted
action is calculated, and simulated in an amount of $100 from September
2013 to March 2014. It is taken into account if the action was a success (buy
11
16. when the series goes up and sell when it goes down) or not, and the condence
of each action is stored as well, calculated as the ratio between the greatest
neuron's output and the second greatest in a natural scale after the activation
function is applied. Assuming o1 as the greatest output, and o2 as the second
greatest, the ratio would be the exponential of their dierence, and the con-
dence would be 1 - ratio. After every single sample of this dataset has been
analyzed, summarized information as the number of actions, the ratio per ac-tion
or the success percentage is calculated, in order to have an outline of every
dierent trained network.
After the execution of each network, a trace of its behavior is saved in a
corresponding le together with some interesting information as the condence
of each action performed. After the execution of the networks, one of the faced
problems is that the number of actions might be too high, driving the results to a
low performance. The rst idea coming up to solve this problem is to use a xed
threshold, so that all the actions with a condence lower than this threshold
will be ignored. The problem that appears here is that in some executions there
are no actions performed at all because this threshold is too restrictive, but
in some other executions the threshold does not bound any actions of the set.
The solution, to use a variable threshold depending on the set of condences of
the series. A parameter indicating the percentage of actions to consider will be
needed, and the action plan will be to sort the list of condences and with the
parameter's help choose the one that will act as a threshold. This way dierent
series with dierent condences can be compared, because the task is done with
relative numbers instead of absolute ones.
Further restrictions can leave some experiments out of consideration. One of
them is a minimum number of movements required after the threshold is applied.
A set of 150 samples is considered for the second validation, so for instance an
experiment that ends up performing only one action out of these 150 possible
ones cannot have very good odds, so it will be discarded. Another constraint
will be the number of the best epoch, as networks that are classifying the data
randomly are not desired. The starting weights of the network's connections are
set randomly, and if after 200 epochs of training, the rst epoch is still the best
one, something is not going well, as the training of the network has not been
able to improve a random one, so it would be discarded.
As an example, a results set where the predicted actions are: 50 samples buy,
50 samples keep and 50 samples sell; which means a total of 100 proper actions.
Assuming that the best epoch was high enough, a hypothetic top 5% of the result
would be quite poor, because only one out of thirty actions would be performed,
but a higher threshold as 25% would probably be better, as now one out of 6
actions will be carried out. When it comes to the practice, this parameter will
have to be examined as well in order to nd the optimum percentage of samples
to take into account. As a minimum number of movements needed the threshold
will be set to 8, same value as the best epoch's minimum number of the network's
training.
12
17. 4 Models used
4.1 The basic model
As it was mentioned before, the initial model will be a regular multilayer per-ceptron
with backpropagation as its training method. The rst thing to do is
normalizing all the data to standard deviation equal to one and mean zero in
order to equally distribute the data and facilitate the learning. When creating
the neural network, one hidden layer with logistic as the activation function
in its neurons will be used, and in the three neurons of the output layer, the
function chosen will be logarithmic softmax. The loss function used to train will
be cross-entropy, which computes it between the given input/target patterns,
interpreting the ANN component output as a multinomial distribution. A batch
size equal to the number of training samples will be used, meaning that all
the samples are read before the actual network is updated, which means more
computation time to process each step, but more accurate steps.
The initial weights will be randomized at the beginning, between the values
-0.1 and 0.1 with the purpose of having a neutral network before the training.
A pocket algorithm will be used, meaning that the network with the best results
will always be available, even if afterwards new training iterations worsen it.
The network will keep training until the current epoch's error is twice as big as
the error of the best epoch, with a minimum of 200 iterations and a maximum of
3000. The parameters needed to tune the training of the network will be given
as the arguments of the APRIL-ANN program, in order to use bash scripts
afterwards that will wrap the execution of the network together with more
dependent and depended scripts, which are the size of the hidden layer, the
learning rate and the momentum.
Finally, the system needs to be tested, and a scan of parameters will be
done for this purpose. The rst parameter will be the starting date of the series.
The series are available from January 2000 to August 2013, as September 2013
and after is part of the validation 2 dataset. Starting dates from January of all
the odd years from 2000 to 2012 and from 2011 and 2013 will initially be used
together with dates starting in July of the years following 2010. The size of the
sliding window is another important parameter to scan, which also sets the size
of the input layer. The initial set of values to check here goes from 5 to 200, in
order to have an initial idea and proceed with more concrete values afterwards.
The next interesting parameter will be the size of the hidden layer, aecting the
topology of the network. The values used here are the same as for the sliding
window and again, further experiments will be performed for the values that are
close to the best results. Another variable for the performance of the network
is the learning rate, which initially will be analyzed from 0.001 to 0.5. Given
that this parameter strongly depends on the size of the network, which in this
case is determined by the sliding window and the hidden layer sizes, new scans
will have to be done once the range of these two parameters is smaller. The
last parameter to analyze is the momentum of the network. Here not so many
options are needed, so the starting values will be 0.0, 0.05, 0.2 and 0.4.
13
18. 4.2 Variants of the model
Now that the basic architecture of the network is understood, some dierent
modications will be exposed before going on with the pertinent results. First,
a change in the input of the network will be presented, then an alternative for
the output, and nally a modication of the training process will be explained,
changing the order the data is given to the model. Finally, in a new section, a hy-brid
model will be presented, as an attempt to put together the main advantages
of both learning ways, with two slightly dierent alternatives.
4.2.1 Sliding window vs Averages as inputs
The easiest of the proposed modications aects the input information that is
being passed to the network. In the basic model, a sliding window taken directly
from the original series was the input. The main problem that this method
presents is that in order to recognize a new pattern with a high condence, an
almost identical one should have been used for training, which is very dicult
taking into account the noise allocated in the nancial series. Another way of
seeing it is that, this way, the network is learning the data by heart, which
makes it dicult to generalize afterwards.
The proposed alternative is to use averages instead of the raw values from
the series, with the objective of learning the tendency of the series more than
the number themselves. Instead of having the window size as one of the variables
of the system, this variable will be a vector where each element represents the
amount of values used to calculate each of the averages that will be used as an
input, always starting from value right before the one that wants to be predicted.
For instance, the vector {9,6,3,1} would mean that the rst element of the input
layer would be an average of the 9 last elements, the second would be the average
of the last 6, the third would be the average of the last 3, and the last one would
be the average of the last one element, in other words, the last element itself,
as can be appreciated in the Figure 6.
Figure 6: Gathering of information to generate four inputs in an {9,6,3,1} aver-ages
model.
14
19. In the previous picture it can be seen that the size of the input layer in this
case would be of four neurons, each of them denoted as i#, where the hashtag
represents its number, containing the averages of the values encompassed in the
gure. To get a better understanding of the dierence of both methods, a series
is represented with both methods in the Figure 7, where the averages model
uses a vector {20,15,10,5}.
Figure 7: Comparison of the uctuations of the same data represented as Raw
and as Averages of the last values from t-20 to t.
In the gure can be seen the values of a series' uctuations from the last
twenty days in both raw values and averages of 20,15,10 and 5. The one using
averages is more general, but contains less information, hence is easier to learn.
4.2.2 Three-class classier vs Forecasting model
The next interesting change of the basic model aects the output of the network,
where in the basic model a binary vector was used representing if the day after
the market went up, down or kept its value. The alternative approach consists
of replacing this output layer of three neurons by a layer with a single neuron,
which contains the real value provided by the nancial series. The main benet
of this resides in the fact that with only one output, there is no reduction of
information in the model. In other words, the model with three outputs considers
a raise of 0.7% and a raise of 5% as the same, when the real repercussion that
the second causes is much higher than the one caused by the rst. Or a slight
dierence between two similar values, such as 0.64% and 0.65%, which are pretty
much the same but are considered as completely dierent outputs. An example
of the dierent types of outputs can be seen in Table 1.
15
20. Date Current value Trend class Forecast
2013-01-31 -2.445 {1,0,0} -1.585
2013-02-01 -1.585 {1,0,0} -3.768
2013-02-04 -3.768 {0,0,1} 2.197
2013-02-05 2.197 {0,1,0} -0.462
2013-02-06 -0.462 {0,1,0} -0.516
2013-02-07 -0.516 {0,0,1} 2.000
2013-02-08 2.000 {1,0,0} -1.177
2013-02-11 -1.177 {0,0,1} 1.932
2013-02-12 1.932 {0,0,1} 0.868
2013-02-13 0.868 {1,0,0} -0.707
2013-02-14 -0.707 {1,0,0} -1.178
2013-02-15 -1.178 {0,1,0} -0.506
2013-02-18 -0.506 ? ?
Table 1: Example of the dierent outputs for the IBEX35's series in a comprised
period from 2013-01-31 to 2013-02-18.
This modication entails two main dierent changes in the neural network
apart from the topology. One of them is the activation function of the output
layer, which until now was a softmax, but as from now on the values do not need
to tend to a discretization, but are continuous instead, the activation function
will be linear instead. The other change regards the loss function, which for the
classier model was a cross-entropy, but as there is only one value now, this
function would no longer make much sense, so it will be changed to the mean
squared error (MSE).
With only one output the problem becomes a forecasting model instead of
a classication in three classes as it was before. As mentioned previously, the
principal advantage is the dierent importance of each value for training, in
order to distinguish strong and weak tendencies, but there is also a negative
side, predominantly concerning two problems. The rst one is merely technical;
basically a forecasting model is not as stable to train as a classication problem.
A forecasting model is more likely to diverge, mostly when high learning rates
are used, but it also does not have to to converge when smaller rates are used. As
lots of dierent experiments are run, sometimes it can be very dicult to know
if the network has converged enough or not, as with the highly noisy nature of
the data a random network can easily provide decent results that can lead to
confusion. The second problem to face with this method is more practical, and
resides in the fact that the highest short time peaks of the stock market are
normally caused by important news, which is in fact unpredictable. This means
that the samples that will have a greater impact on the system are the ones
that probably should not be learned by it, although they are not abundant.
16
21. 4.2.3 Traditional MLP vs Model with data overlapping
This last modication regards the organization and order the data is given to the
system to train, validate and test. It arises from the idea of the dierent contexts
that might have an eect throughout a series. Social, economic and historical
features are very dierent nowadays than before 2005 for instance, mostly with
an economic crisis in-between, which make markets behave dierently. For this
reason, the objective of this modication is to train with data chronologically
closer to the data that is going to be predicted, in order to reduce the dierence
of contexts.
In the regular MLP model explained up to this point, the second validation
data was comprised from 1stSeptember 2013 to 31stMarch 2014, the rst valida-tion
data was the more recent 25% of the remaining samples, and the training
data was the remaining oldest 75%. When the series are long, and they can
be of as long as 14 years, a big gap exists between the data used to train the
network and the data used to do a second validation and or a test. Concretely,
starting in January 2000, the last data used for training is from the beginning
of 2010, which leaves more than three years used for a rst validation that ba-sically
means predicting data of 2014 using a network trained with data older
than 2010. This is an extreme example where the easy solution would be to sim-ply
reduce the size of the series, as probably so much data will not be needed,
but even if the data were reduced, the same problem would appear in a smaller
scale.
The proposed solution for this is to avoid the rst validation dataset in order
to put training and testing datasets closer. To do so, a model where only one
dataset exists is proposed and the network iterates over it in chronological order.
Given a concrete sample, it will be used to test the network and in the next
iteration will be used to train it, whilst it will be tested with the following one.
With this method, each single sample of the network will be used rst to test
it and afterwards to train in order to not test samples that have been used to
train. For instance, assuming a new iteration, the rst thing to do would be
to use the current sample t-1 to train the network and immediately after the
sample t will be used to test it. In the next iteration, t will be used to train
the network and t+1 to test it. This sequence will continue until the last value
of the series has been used to test it. The errors are calculated the exact same
way as they were before, with the only dierence being that in the overlapping
of data model they are calculated whilst the training is being done. Also, the
overall splitting of the data will be kept, meaning that until 31stMarch 2014 the
samples will be used to train and validate, and from April 2014 onwards the
samples will be used to train and test. As one longer continuous dataset will be
used for the second validation and testing of the data, the only dierence will
be that the best results until March 2014 will be picked, and there will be no
choice from April onwards.
A simple outline of this process can be seen in Figure 8, replacing the regular
method shown in Figure 5.
17
22. Figure 8: Scheme showing the proposed method with overlapping of data in one
single dataset.
The main advantage this model has is the absolute utilization of the data to
train the network, and the factor that by learning all the samples in a chronolog-ical
order, the more recent the sample is, the more impact it has on the system,
so that it will forget old samples by learning new ones and modifying the
system according to these. On the other hand, there is a big disadvantage: the
undertting, which will be explained in detail in section 7.2 and can occur due
to the network using each sample to train only once during all the process. This
can be patched up by increasing the learning rate or by using an adequate num-ber
of iterations determined by the length of the network, so that the network
will iterate the correct number of times. Both learning rate and series' length
will be parameters to scan and analyze afterwards, as will be seen in posterior
sections.
4.3 Hybrid model with overlapping of data
Up to this point, dierent modications have been explained with slight dif-ferences
from the initial model; a traditional multilayer perceptron. The most
uncommon model is probably the one with the overlapping of data, which does
not use a typical splitting of data that neural networks normally use, adding the
18
23. small advantage of seizing the data better than regular models at the expense
of the huge disadvantage consistent in data undertting. In order to abate this,
two new models will be presented with the purpose of avoiding the undertting
problem, but without losing the advantageous data seizing.
4.3.1 Explicit validation dataset
The rst of the alternatives is also the simplest one, based on the overlapping
model; a full training of a network is done for each new available sample. Another
way of understanding it would be starting from the basic model and using only
one sample as the second validation dataset, instead of the 150 samples used
before, and iterate over the old whole set. Once this prediction is performed, all
the datasets advance one sample in time, so that the predicted sample is now
used as the last one of the validation dataset, while the following sample will be
predicted now. With this model, a completely new articial neural network is
created for each sample that is needed to be predicted, having a dierent num-ber
of the best training epochs, as they depend on the samples of the datasets,
which are being modied each time. The number of samples used in total for
the prediction of each tendency in time would remain constant for each of the
predictions, as gure 9 shows. The splitting of the training and the rst valida-tion
dataset will be kept as 75% and 25%, same as it was in previous models,
and the starting date will be considered as a network's input parameter as well.
Figure 9: Training methodology of the model with validation dataset.
19
24. There is an obvious disadvantage in comparison with previous models, which
is that the time spent by the model to predict the samples increases considerably,
as now one network is trained and used to predict each sample. In this case,
where 150 samples are available along the validation dataset, the time spent
in previous models gets multiplied by 150. Due to the nature of the problem,
where only one more sample is available per weekday, this is not a big issue, as
in between the closing time and the opening time of the following day there is
plenty of time for training the new models and predicting the new tendencies.
However, the process of looking for the correct parameters is very expensive
in computational terms, taking around 150 times longer than previous models
where the whole second validation dataset was predicted with the same trained
network. A positive side of this is problem is that because this model is consider-ably
similar to the anterior ones; the scanning of parameters should not be very
wide, as the ideal parameters for the other models are already known. Hence,
the scanning of parameters should shorten, with its corresponding reduction of
time.
4.3.2 Implicit validation dataset
The last of the models to analyze is an evolution of the previous one; the hybrid
model with validation dataset with certain characteristics of the overlapping
model. The main idea of the model is to use a training dataset chronologically
as close as possible to the sample that is to be predicted each time by moving
the validation 1 dataset used in previous models. If it were just removed, the
problem faced would be that the stopping criterion would be undened, as it is
set according to this validation dataset. The proposed solution to this problem is
to set a x number of training iterations for each sample's prediction, determined
by the best epoch obtained in previous full trainings with a validation 1 dataset.
The prediction of a determined sample is performed as follows: rst, a net-work
is trained using both training and validation 1 datasets in order to predict
the sample that is immediately after the validation dataset, as was done with
the previous model; then the number of the best epoch is kept and the trained
network completely discarded; later a training dataset of the same length as the
one used before is taken from immediately before the sample to be predicted,
and the training is performed with the same parameters during the stored num-ber
of epochs; nally the resultant trained network is used to predict the sample.
In Figure 10 a detailed process of the training method is shown, where the rst
part of each iteration is used to get the number of the best epoch and the second
to train the actual network with a xed number of training epochs.
20
25. Figure 10: Training methodology of the hybrid model with implicit validation
dataset.
A problem that might appear with this method is that sometimes when a
network is trained the number of the best epoch could be one, meaning that the
training has not improved the initial random model. If the parameters used are
correct, this will not be a common problem, but can still happen. The proposed
solution to this problem is to use more than one previous training to determine
the xed number of epochs, by calculating their average. The number of old best
epochs used to calculate the average will be seven, the last consecutive ones.
21
26. When the best epoch of a training equals one, the average drops down and
sometimes it can have a great impact on the average number of epochs. To solve
this, the solution is to remove the lowest of the seven epochs from the average,
removing as well the greatest so that the average does not become unbalanced.
At the start of the series, the average of the rst networks will be used as
the training epochs for the same number of networks because no previous data
is available. Table 2 shows an example of a series where the number used to
calculate the average is ve, three after the removal of the lowest and greatest
best epochs.
Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13
Training BE 18 25 20 19 17 24 29 14 25 1 21 19 20
Iterations 19 19 19 19 19 21 21 20 22 21 20 18 21
Table 2: Example of the number of iterations calculated out of the best epochs
from the previous 5 samples.
In the table can be seen the resulting number of iterations out of the last 5
samples. For instance, for the sample number 10 the values 24, 29, 14, 25 and 1
are available. Removing the greatest and lowest, which are 29 and 1 respectively,
the values 14, 24 and 25 are remaining. Calculating their average, the obtained
number of iterations is 21, which will be set for the network's training used to
predict sample number 10.
It is important to mention that this model takes on average almost twice
as time as the hybrid with a validation dataset, which was already taking 150
times more than the initial models. It rst needs to train the networks the exact
same way as the previous model was doing, and afterwards train a new network
with dierent samples up to the best epoch of the previous training. When
searching for the best epoch, the training keeps iterating even though when the
errors obtained are worsen the best one, at least the 50% more of the current
best epoch's number. This means for instance that in a training where the best
epoch is reached during iteration number 600, the network will iterate another
300 epochs, until epoch 900 and if the best epoch is still number 600, then it
will stop. During the training of the actual network, only 600 iterations would
need to be performed, a considerable saving of time depending on the case. In
general terms, it can be said that this second hybrid method is approximately
70% more expensive than the rst one.
22
27. 5 Results
In this section a comparison of the dierent alternative models will be shown
starting from the basic system's results. Note that as mentioned in previous
sections the validation 2 dataset comprising from September 2013 until March
2014 will be used to measure each system and modication, and further analysis
will be done in order to check the networks with unknown future values of the
series.
The rst thing needed is a baseline for the results, so several random walks
were generated for the series. For each single day of the validation 2 period, one
action has been randomly picked up with an equal distribution out of buy, sell or
keep. Table 3 shows some information of ten random walks executed, considering
as the number of actions the sum of both buy and sell actions, excluding the
keep ones.
# Final money Actions Benet/action Success rate
1 $84.03 88 -0.192% 42%
2 $97.94 102 -0.015% 47.1%
3 $108.88 95 0.09% 50.5%
4 $113.07 92 0.14% 51.1%
5 $107.28 107 0.072% 56.1%
6 $90.85 103 -0.087% 44.7%
7 $108.97 100 0.091% 45%
8 $107.76 91 0.088% 54.9%
9 $93.84 109 -0.053% 46.8%
10 $100.26 94 0.009% 45.7%
Average $101.288 98.1 0.000143% 48.39%
Table 3: Execution of ten independent random walks showing the nal amount
of money, number of movements, ratio of benet per action and success rate,
together with the average of them all.
In terms of the money obtained, the table shows that on average, using a
random walk strategy, the benet after 98 actions would be $1.288, not very
good. The best performance of the random walk (number 4) has obtained a
benet of $13.07 with a rate of 0.14% per action. On the other hand, the worst
has been number 1 with a total loss of $15.97, meaning an average loss of 0.192%
per action. The median execution is also the closest to the average, number 10,
remaining very close to the initial sum of money, with $100.26. In Figure 11 can
be seen a comparison of the best, worst, and median random walk executions
together with the uctuation of the original series itself.
23
28. 140
130
120
110
100
90
80
Original
Best
Worst
Median
01-09-2013 15-12-2013 31-03-2014
Figure 11: Comparison of the best, worst and average random walks against the
original series.
In order to have a better idea, another 100 random walks have been executed,
showing an average end money of $99.08 with a standard deviation of 10.66. This
reinforces the idea that the series is not biased neither to win nor lose money,
but to maintain its value.
5.1 Basic model
As mentioned in previous sections, a scan of parameters is performed, generating
a big amount of experiments run. An easy and quick solution would be to
choose the experiment that has made the maximum amount of money without
any kind of boundaries in the output, which in the present experimentation
consists in using the series starting in January 2004, a window size of 140, a
hidden layer of 10 neurons, a learning rate of 0.35 and absence of a momentum.
This conguration has managed to obtain $136.8 out of the initial $100 in 150
movements, with a success factor of the 55.3%. The problem is the lack of
stability of the results with similar parameters. For instance, modifying the
momentum, which is the parameter that probably aects the least to the system
from 0 to 0.1, the benet of $36.8 turns into a loss of $22.2 of the initial money,
dropping the quantity to $77.8.This means that the reliability of the result is
very poor, and has obtained the results quite randomly, without learning much
from the series. Analogously to the best result in terms of absolute money, a
24
29. maximum benet rate of 1.1% per movement has been obtained as well as a
success factor of the 100% in other experiments, but none of these experiments
are relevant for the same reason as the one explained before.
The objective of the analysis is to nd a cluster of experiments with similar
parameters and decent results in order to give some reliability to the parameters
used. But before that, a simple postprocess has to be applied to the results, con-sisting
in considering just the top x percent condent action for each experiment,
as explained in the post-process section. In Table 4 dierent top percentages are
compared for a same experiment (window size 140, hidden layer size 35, learning
rate 0.05, and momentum 0) with good results, together with their nal eect
on the initial amount of $100, the average ratio of benets per movement and
the cross-entropy error of the set:
Top percentage Final money Actions Benet/Action Error
All actions $93.9 88 -0.06% 1.10
80% $98.2 74 -0.02% 1.09
70% $100.3 65 0.01% 1.09
60% $109.4 57 0.16% 1.08
50% $114.7 48 0.29% 1.08
40% $114.5 45 0.31% 1.06
30% $120.0 37 0.5% 1.00
20% $117.1 29 0.55% 0.95
15% $116.5 21 0.73% 0.87
10% $114.5 15 0.91% 0.81
5% $110.4 7 1.43% 0.61
Table 4: Final amount of money and average ratio of benets per action for
dierent ltered top percentages applied to a same experiment.
Table 4 illustrates that a greater nal amount of money is not always a
better result, nor a higher ratio of benet per action. Performing the 100% of
the predicted actions in this example (the entire buy or sell actions out of the
150 days), there would be a loss of $6.1. Using the top 10% the nal amount of
money would be the same as using the top 40% but the average gain are dierent,
0.91% against 0.31%. Even though the amounts of money are the same, the top
10 percent is clearly more convenient considering the commission charged by the
brokers explained at the beginning of this document. Also, as a lower number
of actions is required, a higher benet per action is reached meaning that less
risk is taken. The highest ratio per action has been reached by the top 5%, but
not all the potential of the model would be seized, as using the top 10% or
15% there is a lower ratio, but more actions are taken into account generating
a higher amount of money, which is probably worth, at least, consideration.
If the decision were to invest uniquely in this series, a higher percentage of
actions would be more recommended. For instance the top 30%, which has a
good ratio of benets per action using a decent number of actions that would
25
30. increase the money without investing too much or too little, and has managed
to gain the highest amount of money among the tested top percentages, $120.0.
If more series are taken into account, tighter top percentages would be better
options, as at every moment one action per series would be available, meaning
that just using the top actions of each series, a high number actions would be
performed over time among all the considered series. The Figure 12 shows the
behavior of each top percentage's eect through time on an initial amount of
$100:
125
120
115
110
105
100
95
90
Top 20%
Top 30%
Top 40%
Top 60%
All
01-09-2013 15-12-2013 31-03-2014
Figure 12: Timeline showing the behavior of applying dierent top percentages
to a same results le through time.
Finally, as the best percentage for one single series was between 20 and 40%,
it is decided to use the top 35% of the actions, and the rst consequence is
an augment of the average benets per action, as expected. The action that
got the highest amount of money with the top 35% got a total of $125.8 in 52
movements. This is the best result so far, as the best execution using the total
of actions managed to end up with $136.8 out of 150 movements; around one
third more money with three times more movements. The parameters used for
the current best experiment are: starting date, January 2011; window size, 80;
hidden layer size, 35; learning rate, 0.35; and a momentum of 0.005.
If instead of picking the best experiment, the set of results gets pruned little
by little until just a few good results are left, the set of results would end up
with the following constraints: starting date, only January 2012; window size,
26
31. between 90 and 100; hidden layer size, between 15 and 45; learning rate, lower
or equal to 0.05 and momentums lower than 0.1. This constraints oer a set
of more than 70 experiments, where the poorest performance ended up with a
gain, $104. The nal parameters used are not the ones that performed the best,
but the ones that are in the middle of a set of best performance, which are the
following: window size, 100; hidden layer size, 15; learning rate, 0.05; momentum,
0.005. These parameters obtained a total of $109.1 in 11 movements, meaning
a rate of 0.8% and a success rate of 81.8%; very good results.
Now the results need to be tested in order to know the real obtained poten-tial,
and for this purpose a test dataset is available, comprising from 1stApril
2014 to 31stJuly 2014. Also, the length of the series used for training will be con-stant,
meaning in this case that instead of starting the series on the 1stJanuary
2012, it is starting on the 1stAugust 2012, as there is a 7 months lag between
both datasets. Another thing to take into account is that in the previous val-idation
dataset 150 days were available in total, and in this test dataset the
amount of days has shrank to 87 days, as the dataset has shorten from 7 to 4
months.
When the parameters are applied to a network generated for the test dataset,
only one action out of the 87 possible ones is carried out, on the 29thApril, where
the predicted action was to buy and the market went up a 0.44%, meaning a
nal amount of money of $100.44, and obviously a success rate of the 100%.
This result is rather disappointing, as more actions were expected, although the
benet per action is still fairly good. After this, no further experiments with
the basic model will be done, as the hybrid systems are expected to be more
powerful than the present model, so the eort would be put on them.
5.2 Averages as inputs
The rst of the proposed modications was the substitution of a sliding window
by the use of averages of recent values, as explained in section 4.2.1. Coming
to the execution of these networks, the rst striking part of these experiments
is the speed improvement when training. While with the sliding window the
input layer could have sizes of up to 140 neurons, with this model they will
rarely have more than ten neurons in the input layer. This is a huge reduction
of the computational time, as every hidden layer's neuron is connected to all
the input layer's ones. On the other hand the pre-process of the data takes a
little bit longer, as the averages have to be calculated, but this time is far less
than the time saved during training. Furthermore, this does not have to be done
for each single training of the networks, one pre-process of data is needed for
each input pattern, comprising a lot of networks available to train with dierent
parameters.
The scan of parameters is pretty much the same as in the previous model,
but instead of using a number for the window size, now a list of number is
used, with patterns as 50-30-20-15-10-5-4-3-2-1, 20-15-10-5, 20-15-10-5-3-1, 100-
80-70-60-50-40-30-25-20-15-10-5 or 25-20-15-10-5. Other patterns checked were
50-49-48-...-3-2-1 with dierent lengths instead of 50. However, the best results
27
32. were performed by the simplest patterns, like the multiples of 5 up to 20 or 25.
Multiples of other numbers were tried, but the results were not better than with
5, so the most of the experiments were stuck to this kind of patterns.
First of all, it is important to note that the best experiment using the total of
the actions was using the following parameters: starting date, January 2011; pat-tern,
100-80-70-60-50-40-30-25-20-15-10-5; hidden layer size, 20; learning rate,
0.5; momentum, 0.005. It managed to get a total of $134.4 in 87 movements.
Again, it is not desired to perform the total of the actions, so only the 35% top
most condent actions would be taken into account.
Using the top 35% of the actions, the parameters that performed the best
result are: starting date, January 2008; pattern, 30-25-20-15-10-5; hidden layer
size, 100; learning rate, 0.5; momentum, 0.02. The nal amount of money was
$138.8 in 56 movements. Note that, even though the 35% of 150 actions are
considered, which will suppose a maximum of 53, 56 actions are performed. This
is due to the fact that dierent actions might have the exact same condence,
and when this happens with the cut-o action that separates the ones considered
and the ones discarded, all the equal condence actions are considered as inside
the threshold.
Similarly to the basic model, the very best experiment probably will not be
the most convenient, so another pruning process is carried out over the present
set of results. Finally, a good bunch of results is obtained using the following
boundaries: starting date, January 2008; patterns, 25-20-15-10-5 and 40-39-38-
37-...-4-3-2-1; hidden layer sizes, 100 or lower; learning rates of 0.01 and 0.05;
momentums of 0.0, 0.005 and 0.02. Note how two extremely dierent input
patterns provide the best results, while very similar patterns to both of them
were discarded as the obtained results were not as good. The remaining set is
formed by more than 50 experiments, and excluding two punctual results that
ended up with a loss of almost $10, the experiments' gains oscillate between
$119.9 and $102.1, with an average of $109.27 including them all.
When it comes to picking the best result, it is generated by the following
parameters: starting date, January 2008; pattern, 40-39-38-37-...-4-3-2-1; hidden
layer size, 100; a learning rate of 0.05 and a momentum of 0.005. The experiment
ended up with a total of $115.1 out of the initial $100, obtained in 45 movements,
and with a cross-entropy error of 1.05. If the experiment is replicated with the
test dataset, the results obtained are: a nal amount of money of $106.82 in 24
movements, 0.28% benet per action, 66.7% of success rate and a cross-entropy
error of 1.072, quite consistent with the results obtained during the validation
period.
28
33. 5.3 Forecasting model
In this section the results obtained with the second modication will be pre-sented,
consisting in the replacement of the three neurons output to a one single
one, changing the model from a classier to a forecasting model, as explained
in section 4.2.2. The input of the model used is a sliding window, as the modi-
cations are made to the basic model.
The scan of parameters is done the same way as for the initial model, as the
only dierence is the output and it cannot be changed during the experiment.
As this model is not as easy as previous ones to train because it is more prone to
diverge, smaller learning rates will be used. Instead of using a smallest learning
rate of 0.001 as it was before, for this experiments the minimum value of this
parameter is ten times lower, going from 0.0001 to 0.1.
After all the experiments have been run, the rst remarkable fact is that to
the naked eye, the set of results is more prone to gain money, unlike results of
previous models where the results tended to remain around the initial amount
of money without any clear tendency. In this set of experiments, the parameters
that performed the best in terms of nal amount of money are the following:
starting date, January 2013; window size, 35; hidden layer size, 25 neurons;
learning rate of 0.1 and no momentum. With a total of $144.4 in the 150 move-ments,
a MSE of 0.5584 and a success rate of 58%. In the experiments of this
model, all the initial results are performing the total of 150 actions, as there is
no possible keep action available, just the numerical prediction of the market
going up or down, increasing the number of actions performed in comparison to
classier models.
Taking into account only the top 35% of the actions the best result is ob-tained
by: starting date, January 2013; window size, 35; hidden layer size, 25;
learning rate, 0.1; and no momentum. With a total of $139.6 in 143 movements,
it can be said that it is a very poor result, rstly because the highest learning
rate is used as well as the shortest period of data, making it not very likely to
perform the best; and secondly because has performed 143 actions out of 150
when taking just the top 35%, meaning that more than 100 predictions along
the series have the exact same condence, which is not a good symptom at all.
When pruning the set of results, 88 good ones are obtained with the following
boundaries: starting date, January 2012; window sizes, from 20 to 35; hidden
layer sizes, from 20 to 80; learning rate, 0.0002; and momentums up to 0.1. The
set of results is excellent, except one execution that regardless its momentum
ended up with $98.4, the amount of money in the set goes from $103.5 in the
worst of the cases to $127.5 in the best of them, slightly better than the results
obtained with the classiers. On the down side of the selection of results, it
can be noted that the average number of movements in the set is higher than
in previous experiments, mostly 52 movements, and a little bit more for a few
experiments. An exception is, for a window size of 35 and a hidden layer of 50,
depending on the momentum, the number of movements is 150, 147, 67 or 77
meaning that no good distinction has been made by using the top 35%. When it
comes to the election of the best parameters, a good option is as follows: starting
29
34. date, January 2012; window size, 20; hidden layer size, 20; learning rate, 0.0002;
and momentum, 0.005. These parameters managed to obtain a total of $118 in
52 movements, meaning an average of 0.32% per action, a success rate of 57.7%,
and a MSE of 0.5.
In comparison to previous models, the number of movements is very high.
In order to mitigate this and try to increase the average benet per action,
the top 35% will be reduced to the top 20%. Analyzing the same bounded set
of results, the average benet rates have increased in general, they are still all
gaining money apart from the same one as before, which now is losing slightly
less, ending up with $98.5. The typical number of actions has gone down from
52 to 30, and the result marked as best, now is showing a results of: $114 in 30
movements, MSE of 0.58, a benets ratio of 0.44% per action and a success rate
of 60%; good results for such high number of movements.
Applying the best parameters to the test dataset from 1stApril to 31stJuly
2014 the results are rather disappointing, as with the top 20% the nal amount
of money is $98.25, in 17 movements, meaning a loss of 0.1% per action. When
the rest of the parameters included in the set of good results are run in the
test dataset, the results do not seem to improve, as now there are more exper-iments
with losses than with gains. This is due to, apart from the problem of
extrapolating results that will be explained in section 7.3, during the training of
both models the series' tendency was bullish, strongly aecting the forecasting
model's training by biasing it towards buy actions, but during the test period
it was not. The series high noise made the learning very dicult, ending up in
a very short range of condences very close to the average, which in these cases
was positive (bullish series), hence the buy predictions were abundant.
5.4 Overlapping of data
In the present section the results obtained from the model explained in section
4.2.3 where the regular MLP was replaced with a model with data overlapping
will be shown and explained. One thing to mention is that the learning rates
used will be bigger than the ones used in previous models, as the maximum
number of training iterations is limited, while it was not for previous models.
Also, note that this modication will be applied to the basic model, a classier
with three outputs and a sliding window as the input of the network.
When analyzing the results, the parameters performing the best in terms
of the nal amount of money are the following: starting date, January 2013;
window size of 30 values; hidden layer size of 50 neurons; learning rate, 0.25;
and momentum of 0.02. They ended up with a total of $135.3 in 80 actions,
which is a 0.44% benet rate with a success ratio of 58.8%. With an overall
picture of all the results and taking the top 35% of the actions, it is remarkable
how the amount of money varies depending on the starting date of the data.
One of the possible a priori commented problems was the undertting of the
data, which would have been overcome with the fact that longer series would
be used but old data would have been forgotten by the learning of newer data.
The results have demonstrated this armation, as Figure 13 shows:
30
35. 125
120
115
110
105
100
95
90
85
80
2000 2002 2004 2006 2008 2010 2012 2014
Figure 13: Fluctuation of the experiments' money obtained from September
2013 to March 2014 considering series starting in dierent points of time.
As can be seen in the above gure, no big dierences are appreciated with
naked eye along the dierent tested starting dates. Networks with series start-ing
in January 2000 are trained for more than 3500 epochs before the validation
datasets is taken into account, whilst network trained from January 2013 are
trained for barely 150 epochs, being their results not that dierent. This is be-cause
for older series the network updates its weights with the newer samples
forgetting older samples. Also, it is proven that networks learn more start-ing
with random weights in their connections [14], and due to the high noise
the training of this data is not far from a random initialization, minimizing
the training carried out from older samples. Note again that for all the experi-ments
explained in this document, the networks' connections weights have been
initialized with random values between -0.1 and 0.1.
Using the top 35% most condent movements, the best parameters change
to: starting date, January 2008; window size, 25; hidden layer size of 60 neurons;
learning rate of 0.45 and a momentum of 0.02. These parameters have obtained
a total of $121.2 in 27 movements, meaning a very good benets rate of 0.79%
per movement. As in previous sections, when trying to prune the set of results
in order to minimize the loss, the remaining set is quite big, as 144 results are
remaining with the following constraints: starting date from January 2006, 2008
and 2010; window size of 60, 80 and 100 values; hidden layer size of dierent
values between 20 and 50; learning rate between 0.08 and 0.12 and the absence
of a momentum. In this set the results are not so great as they oscillate from a
maximum loss of $0.9 to a maximum benet of $4.5.
31
36. Finally, the chosen best result is created out of the following parameters:
Starting date, January 2010; window size, 60 values; hidden layer size, 20 neu-rons;
learning rate of 0.1 and a momentum of 0. The nal amount of money
obtained is $103.1 in 6 movements, a very good positive benet ratio of 0.52%
and a success rate of 83.3%, but with very few number of actions performed,
only the 4% of the 150 possible movements. With this group of parameters, the
network trained for the test set obtained $100.8 in two movements, a benets
ratio 0.4% of and a success rate of 50%. As the basic model, the results are a
little bit poor in terms of number of actions carried out, despite the benet ratio
is still good. Again, no more experiments will be performed, as the following
hybrid systems are a priori more powerful models.
5.5 Summary
After analyzing the basic model, the two simple modications and the model
with overlapping of data, some early conclusions can be obtained. First of all, in
Figure 14 a comparison of the best results obtained with the used top percentage
of each model can give an idea of the maximum potential of each experiment.
The top percentages in terms of actions' condence are the 35% for all the
models but the forecasting one, which is using the 20%.
140
135
130
125
120
115
110
105
100
Basic
Averages
Forecasting
Overlapping
01-09-2013 15-12-2013 31-03-2014
Figure 14: Comparison of each of the experiments' best result through time in
the second validation dataset.
32
37. About the previous gure must be mentioned that the forecasting model
has the advantage of performing more than 140 actions, as explained in section
5.3, while the basic and averages models are performing around 55 movements
and the overlapping one only 6. This means that even though the forecasting
model has obtained the greatest amount of money, other models are probably
better, as their averages of benets per action are greater. When it comes to
the chosen experiment of each model, a comparison of the four models can be
seen in Figure 15.
116
114
112
110
108
106
104
102
100
Basic
Averages
Forecasting
Overlapping
01-09-2013 15-12-2013 31-03-2014
Figure 15: Comparison of the well-generalized results of each of the four basic
models in the second validation dataset.
Using the information shown in the present subsection together with the pre-vious
results of the dierent models, the forecasting model can be discarded on
behalf of the other basic models. Also, the overlapping model's results have not
been bad, fact that provoked the evolution of the model towards the explained
hybrid models. Both the basic and the averages as inputs models behaved quite
well and seemed to learn some useful patterns.
As a last important point, the comparison of the dierent models' results
through the test dataset is shown in Figure 16.
33
38. 108
106
104
102
100
98
96
94
Basic
Averages
Forecasting
Overlapping
1-4-2014 1-6-2014 31-7-2014
Figure 16: Comparison of the chosen parameters congurations of the dierent
models applied to the test dataset.
5.6 Hybrid models
In this subsection the basic results obtained in both hybrid models will be
explained. These models are obviously more powerful than the model using
overlapping of data, as they are a extreme simplication of it, which is doing
only one training step per sample and using a batch size of one. They are also
more powerful than the basic model, as this one should be able to perform, at
least as good as it, due to the use of more recent data to predict each sample.
After the results obtained and explained in previous sections, it is decided to
discard the forecasting model, and analyze these hybrid models with both a
sliding window and averages as inputs.
When running these networks, one of the rst things to note is that in gen-eral
the number of actions performed in both models when short series are used
is very low, generally not more than 5 actions out of the possible 150. To mit-igate
this, what has been done is decrease the threshold used to consider an
action bullish or bearish, which was ±0.65%, to ±0.60%, and run some more
experiments. This way, more actions would be performed, as the number of keep
actions is decreased, for instance, an increase of 0.62% that would have been
considered as a keep action, now is considered a buy action. Thresholds lower
than 0.6 were tested as well, but they were mostly leading to more unstable
34
39. networks where the total of the actions were either buy or sell, but not a com-bination
of both with a proper learning of patterns. In Figure 17 the number of
performed movements is shown for dierent starting dates through time, where
is demonstrated how the number of performed movements increases with older
starting dates, tending to a maximum average of around 100, which is exactly
two thirds of all the possible actions.
120
100
80
60
40
20
0
2010 2010.5 2011 2011.5 2012 2012.5
Figure 17: Number of performed actions against dierent starting dates of the
series.
A new issue that comes up with this models is that the condence threshold
used in previous models is not that reliable now, as dierent networks are used to
predict each of the samples. Compare dierent networks that have been trained
with dierent models and a dierent number of epochs is not that simple. In
general all the actions will be performed without bounding the condence, as not
so many are normally carried out, and in section 6 a method for choosing when
to perform the action or stay away from the market will be shown, although not
from a technical point of view.
35
40. 5.6.1 With explicit validation
The rst of the hybrid models is explained in section 4.3.1, which was very
similar to the basic model, but generating a new network for each new sample,
using data as recent as possible. As mentioned earlier, the main disadvantage
is the time taken searching for the ideal parameters, one per sample in the
second validation period. The experimentation was reduced a little bit by using
parameters not too dierent from the good ones obtained in the basic model, in
order to nish the experiments in a reasonable amount of time.
The range of parameters used was the following: the thresholds used for
separating the classes were ±0.60% and ±0.65%; the starting date oscillated
between July 2009 and July 2012; window sizes between 40 and 140 values and
few of the best patterns in the averages model; hidden layer sizes between 10
and 100 neurons; learning rates between 0.003 and 0.011 and momentums lower
or equal than 0.05. The parameters performing the best results in terms of the
nal amount of money were the following: starting date, July 2010; window size
of 80 neurons and hidden layer of 25; learning rate of 0.003 and a momentum
of 0.005. The average best epoch of this model's executions was 791 epochs,
making a total of $149 in 106 movements, a rate of 0.38% per movement and a
success rate of 63.2%
In total, 794 experiments were run, with pretty good results, as 577 con-
gurations managed to earn money, while, 217 ended up with a balance lower
than $100, meaning that the 72.77% of the experiments were positive, while
in previous experiments this percentage was closer to 50%. One of the most
inuent parameters is the starting date, and according to it the results set was
bounded to keep experiments just which starting dates from November 2011,
January 2012, and February 2012. Setting the class threshold to ±0.60% as well,
the number of experiments gets reduced to 288, with only 10 of them ending up
with a negative economic balance.
When choosing the best result, a group of experiments got outstanding re-sults,
formed by a window size of 140, hidden layer of 20, momentum of 0.05
and dierent starting dates and learning rates. Six experiments are in this set,
with the same results, $104.5 in 5 movements, a very good ratio of 0.88% per
action. The ratio is outstanding, but take a conguration with such small num-ber
movements might be risky, mainly when a priori better options show up in
the rest of the set.
Finally, the sample tagged as best was the one were each parameter per-formed
its best in the total set. These parameters were as follows: starting date,
January 2012; window size, 80 neurons; hidden layer size, 60 neurons; learning
rate of 0.07 and a momentum of 0.05. This execution managed to earn a total
of $13.6 out of the initial $100 in 32 movements, meaning a benets rate of the
0.4% and a success rate of 65.6%. The average best epoch of the 150 trained
networks for the prediction was of 391.6 epochs.
36
41. When applying these ideal parameters to the test dataset, the starting date
is moved to August 2012 in order to keep the series length constant, and the
results obtained are as follows: Average best epoch, 152.24; a total of $102.52
meaning a benet of $2.52 in 5 movements; a benets rate of 0.505% and a
success rate of 80%, 4 out of 5 actions were right.
5.6.2 With implicit validation
The last of the results' experiments correspond to the second hybrid model where
the use of the rst validation model is completely skipped when training the net-works
used to predict new values, as explained in section 4.3.2. Apart from the
already explained advantage of using data for training which is chronologically
closer to the values that are predicted, another positive fact of the present model
is that the results obtained in the second validation dataset should be more re-liable.
This is because the network that minimizes the cross-entropy error is not
used, but its features are applied to a dierent set of data, which is increasing
the importance of both the parameters and the number of training epochs. The-oretically,
this is a good rst step to mitigate the problem of the extrapolation
of results present in other models, and that will be explained in section 7.3, as
something like a pre-testing phase is being done before the actual test of the
results.
With the current model, the ranges of parameters used for scanning are
pretty much the same as in the hybrid model with explicit validation, as theo-retically
the results should not be too dierent. Similarly to the previous model's
results, very few actions are performed when ±0.65% is the threshold used to
separate the three types of classes, so again the most of the experiments will be
performed using ±0.60% as a divider threshold for deciding the actions.
When coming to the actual results, this models shows fairly good general
results, as 182 out of 204 experiments obtained a positive rate, meaning that the
89.2% of the experiments managed to gain some money, while the 10.8% ended
up with less than the initial amount. Again, the results show certain reliability
on this method, as the dierences between similar parameter congurations are
very smooth. Also, the average cross-entropy error is low, being the maximum
not more than 1.08, while in other models it was not uncommon to have samples
with extremely high errors meaning that no convergence was present.
The results improve when only experiments using as their starting date the
1stNovember 2011 are taken into account. In this new set of 50 results the worst
of the cases managed to get $103.9 in 32 movements, meaning a rate of 0.12%
per movement and its cross-entropy error was of 1.0602. In this set obviously
the 100% of experiments ended up with a money gain, as the minimum gain
was $3.9, while the maximum amount of money was obtained with the following
parameters: starting date, November 2011; window size, 90 values; hidden layer
size, 40 neurons; learning rate of 0.09 and a momentum of 0.05. With this
parameters the cross-entropy error went down to 1.057, and the amount of
money obtained was of $112.6 in 22 movements, meaning a benets ratio of
0.55%. The success rate was of the 63.6%, and the average best epoch while
37
42. training was of 294 iterations.
As in previous sections, these parameters need to be used in the test dataset,
with the only dierence of the starting date, which is moving from November
2011 to June 2012 in order to keep the length of the series. When testing the
series, there is a big drop of the results, as the money obtained was of $100.3,
but using only one movement out of the 90 possible actions, which means that
the ratio of benets per action is not too bad, 0.3% per movement. In this case,
the cross-entropy error went up to 1.063, which is still good while the average
best epoch was of 78 iterations per network trained. The results are not as good
as expected after the optimism generated in the validation period, mainly due
to the number of actions is too low, as in some previous cases.
Finally, after all the models have been considered and the results showed,
and as a continuation of the summary explained in section 5.5, it can be said that
in general terms the hybrid models have been more reliable than the basic ones.
Simpler models as the basic one or the one modied for using averages as inputs
have managed to get more money in both the validation and test datasets,
but the hybrid models have managed to keep the cross-entropy error lower
avoiding irregularities. Also more networks were taken into account for each
series execution, meaning that punctual good experiments might be obtained
luckily, but this is not that likely to happen when considering the average of
150 networks. Lastly, and because of the problem's nature, the reliability is
something extremely critical, meaning that the models with a more practical
application would be the hybrid ones. Concretely the one using an implicit
validation dataset, as its results' might be more easily trusted due to the use
of an extra series of values for training, which is giving something similar to an
extra test phase.
38
43. 6 Combination of models
Previous sections have shown how one single nancial series can be predicted
using dierent methods, with their corresponding results. It has been demon-strated
that normally when not many actions are performed in a series the
results tend to be better than when using a lot of them, either choosing the
ones by the condence or by reducing decision boundaries for buy/keep/sell
classes. Table 4 is clearly illustrating this. But, what if more actions want to be
performed without losing performance? One of the solutions applicable in a real
case would be the use of more than one series, and this is what is going to be
explained in this section.
The series used to evaluate all the methods of the present document was
Abengoa Abertis (ABE.MC in yahoo nance), as mentioned at the beginning
of it. In order to expand the experiments the rest of the series included in the
IBEX35 will be used, as they have close behaviors due to their strong inuence
by the Spanish economy. The idea of this section is to show a simple demonstra-tion
of how to combine dierent series in a practical way, so the training will not
be as deep as in previous sections, and the parameters used for the networks'
training of the dierent series will be the same for them all. The list of series
considered represented by their yahoo nance codes is the following:
ABE.MC BME.MC GAS.MC MAP.MC SAN.MC
ACS.MC CABK.MC GRF.MC MTS.MC SCYR.MC
AMS.MC DIA.MC IAG.MC OHL.MC TEF.MC
ANA.MC ENG.MC IBE.MC POP.MC TL5.MC
BBVA.MC FCC.MC IDR.MC REE.MC TRE.MC
BKIA.MC FER.MC ITX.MC REP.MC VIS.MC
BKT.MC GAM.MC JAZ.MC SAB.MC
Table 5: List of the stock market series' codes considered in this model.
The IBEX35 index is composed by 35 dierent stocks, but as ABG-P.MC
stocks are not participating in the Spanish exchange market for more than two
years, the easiest solution was to stop considering it, as the corpus of 34 series
is big enough.
When starting the technical part the rst issue to appear comes from the
splitting of data. For the initial series dierent thresholds as ±0.60 or ±0.65 were
tested manually, but now some problems appear when using the same threshold
for all the series, as some good threshold might mean something completely
dierent for another series. The solution applied for this issue consists in moving
automatically the threshold for each starting date of each series in order to
minimize the standard deviation of the three classes (buy, keep, sell). This is
done by an small algorithm that assumes the greatest of the series' dierences
in absolute value as the starting threshold and moves is down repeatedly until
the standard deviation is minimum. For instance, assuming the following simple
series shown in Table 6:
39
44. Day Dierence
1 1.5%
2 -0.6%
3 0.0%
4 -1.1%
5 0.7%
Day Dierence
6 1.9%
7 -3.6%
8 -0.9%
9 4.0%
10 1.2%
Table 6: Example series with daily variations through 10 days.
In the example above a series of ten values is considered, so this would be the
number of iterations needed. The algorithm would proceed as Table 7 shows:
Iteration Threshold Buy Keep Sell StD Best StD Best Threshold
0 ±1 0 10 0 5.77 5.77 ±1
1 ±4.0 1 9 0 4.93 4.93 ±4.0
2 ±3.6 1 8 1 4.04 4.04 ±3.6
3 ±1.9 2 7 1 3.21 3.21 ±1.9
4 ±1.5 3 6 1 2.52 2.52 ±1.5
5 ±1.2 4 5 1 2.08 2.08 ±1.2
6 ±1.1 4 4 2 1.15 1.15 ±1.1
7 ±0.9 4 3 3 0.58 0.58 ±0.9
8 ±0.7 5 2 3 1.53 0.58 ±0.9
9 ±0.6 5 1 4 2.08 0.58 ±0.9
10 ±0.0 6 0 4 3.06 0.58 ±0.9
Table 7: Example execution of the algorithm for choosing the classication
threshold of a series according to the standard deviation of the classied sam-ples.
In the example can be seen that the threshold performing the best distribu-tion
of data is 0.9, which is bringing the standard deviation down to 0.58, and
it would be the one used for this ten values series.
Once the splitting of classes is understood the methodology for choosing the
action to perform will be considered. Few dierent formulas will be tested, start-ing
from simple ones as just taking the action with the maximum condence, to
more complex ones, as using the last 50 actions and calculate the nal money by
simulating the series, or the average benets ratio. These dierent approaches
will be explained more in depth in the following subsection.
As a last concern of this model's preparation, the training methods comes
up to scene. The same methods used for previous models will be applied here,
given a set of parameters, all the samples forming the validation dataset will be
predicted and their results summarized. The average of these result's summaries
will be used as a measurement for the given set of parameters, in order to choose
the most suitable set of them.
40