SlideShare a Scribd company logo
1 of 58
Download to read offline
The Use of Neural Networks 
for Tendency Prediction in 
Financial Series 
September 20, 2014 
Estudio del uso de redes neuronales en la predicción de 
tendencias en series de nanzas 
Proyecto n de carrera 
Universidad Politécnica de Valencia 
Escuela Técnica Superior de Ingeniería Informática 
Author: Juan Francisco Muñoz Castro 
Director: Salvador España Boquera 
Co-director: Francisco Zamora Martínez 
i
Abstract 
In the present project, a comparison of dierent types of articial neu- 
ral networks has been used to analyze their behavior with noisy time series 
prediction, with the goal of maximizing the benet obtainable by investing 
in them. To do so, a wide range of datasets has been used, containing stock 
market prices until September 2014 and starting from January 2000 on- 
wards. The starting experiment has been a regular multilayer perceptron 
using a sliding window of the latest values as the input of the network and 
three outputs representing three possible actions as buy, sell or keep. Fur- 
ther experiments have been tested, such as the replacement of the three 
outputs classier by a single one, converting the system in a forecasting 
model with only one output, or the use of dierent averages of recent val- 
ues instead of a simple sliding window as the network's input. Also, it has 
been tested the use of a single dataset from where each sample is used 
rst to test and validate, and to train the network later on in a new step 
instead of the traditional way of training-validation-test splitting of data. 
Finally, two new models that seize all the data have been tested, one with 
a specic period of data validation, and the other one with an implicit 
period, as it has been skipped by doing some networks pre-training. After 
a comprehensive applying of these methods to the time series, certain pre- 
dictability was found. Some networks were able to predict the direction 
of change for the next day with an error rate of around the 40%, which 
in some optimistic cases decreases to about 30% when rejecting examples 
where the system has low condence in its prediction. A practical simu- 
lation has been explained, showing an average gain close to the 0.33% by 
acting the half of the times. 
ii
Contents 
1 Introduction 1 
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 
1.3 Stock market basics . . . . . . . . . . . . . . . . . . . . . . . . . 2 
1.4 Structure of this report . . . . . . . . . . . . . . . . . . . . . . . 2 
2 Time series prediction 3 
2.1 Articial Neural Network basics . . . . . . . . . . . . . . . . . . . 3 
2.2 Dierent techniques . . . . . . . . . . . . . . . . . . . . . . . . . 5 
2.3 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 6 
3 Experimentation process 7 
3.1 Tools used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 
3.2 Basic strategy and data used . . . . . . . . . . . . . . . . . . . . 7 
3.3 Performance measurement . . . . . . . . . . . . . . . . . . . . . . 8 
3.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 9 
3.5 Post-process of the data . . . . . . . . . . . . . . . . . . . . . . . 11 
4 Models used 13 
4.1 The basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 
4.2 Variants of the model . . . . . . . . . . . . . . . . . . . . . . . . 14 
4.2.1 Sliding window vs Averages as inputs . . . . . . . . . . . . 14 
4.2.2 Three-class classier vs Forecasting model . . . . . . . . . 15 
4.2.3 Traditional MLP vs Model with data overlapping . . . . . 17 
4.3 Hybrid model with overlapping of data . . . . . . . . . . . . . . . 18 
4.3.1 Explicit validation dataset . . . . . . . . . . . . . . . . . . 19 
4.3.2 Implicit validation dataset . . . . . . . . . . . . . . . . . . 20 
5 Results 23 
5.1 Basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 
5.2 Averages as inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 27 
5.3 Forecasting model . . . . . . . . . . . . . . . . . . . . . . . . . . 29 
5.4 Overlapping of data . . . . . . . . . . . . . . . . . . . . . . . . . 30 
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 
5.6 Hybrid models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 
5.6.1 With explicit validation . . . . . . . . . . . . . . . . . . . 36 
5.6.2 With implicit validation . . . . . . . . . . . . . . . . . . . 37 
6 Combination of models 39 
6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 
7 Problems encountered 46 
7.1 High noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 
7.2 Overtting and Undertting . . . . . . . . . . . . . . . . . . . . . 47 
7.3 Extrapolation of the results . . . . . . . . . . . . . . . . . . . . . 50 
8 Conclusions 52 
9 Future work 54 
iii
1 Introduction 
1.1 Motivation 
Since the existence of a stock market exchange, this has been one the most 
important indicators or even predictors of the economy in worldwide terms. 
With an average daily trading value of 169 billion dollars during 2013 just in the 
New York Stock Exchange, this indicator shows how important for the economy 
is. Because of this, so many attempts to predict it have been made, some more 
successful than others, but never with outstanding results. In fact, the idea 
that the market is completely unpredictable is widely accepted, mainly because 
its value is driven by news, which is unpredictable by denition, and would 
make following values of the stock market depend exclusively on the present and 
future, never on the past. This idea is asserted as the ecient-market hypothesis 
(EMH), which states that stocks always trade their fair value on stock exchanges, 
making it impossible for investors to either purchase undervalued stocks or sell 
stocks for inated prices. 
In contradiction to the EMH, there are two main types of analysis: funda-mental, 
which is the process of looking at a business at the basic nancial level; 
and technical, which is the methodology for forecasting the direction of prices 
through the study of past market data. 
Numerous articles have been published, based in the technical analysis, sev-eral 
of them using Articial Neural Networks, which show certain predictability 
in these nancial series in contrast to the previous statement, which enhances 
the initial motivation of this project. In this area is where the present project 
has been developed, trying to predict the tendency of dierent recent stock mar-ket 
instruments, comparing the results for each technique as well as determining 
the predictability of the dierent instruments coming from dierent scopes. 
Apart from merely technical reasons, there is the basic root motif consistent 
on the nancial gain. A system able to determine the trends of the market with 
good reliability is an extraordinary tool that many investors and researchers are 
continuously looking for in the search for the high benets of their investments. 
1.2 Objectives 
The objective of this project is to experiment, analyze and explain how dierent 
types of Articial Neural Networks can predict future values of nancial series, 
based on the technical analysis that simply uses the historical prices. 
Provided with datasets of daily market data, will be assumed that one action 
can be carried out per day at the stock market opening time that will be canceled 
at the end of the same day. With this premise, the main objective is to maximize 
the benet by investing a given amount of $100, in terms of the information 
used as the ratio of benet per movement or the percentage of success from a 
nancial angle. From a more technical perspective we will analyze the behavior 
of the parameters that aect the evolution of the Neural Networks, both input 
parameters and output measurements. 
1
1.3 Stock market basics 
First of all the denition of a stock exchange will be given, which according 
to Wikipedia is a form of exchange that provides services for stock brokers and 
traders to trade stocks, bonds, and other securities. There are two possible ways 
of taking part in the stock market: 
ˆ Buying stocks: the current price of the stock is paid and whenever this 
stock is sold the money worth of that stock is simply given to the investor, 
so if the stock has increased its price this dierence will be gained, if it 
has decreased its price, the dierence will be lost. 
ˆ Short selling stocks: in this case, the investor is lent stocks that are sold 
instantly, with the commitment that he will have to give these stocks 
back; therefore it will be necessary to eventually buy them again in order 
to return them to the lender. In colloquial terms, it can be said that this 
is a betting for the stock market to go down; the lower down it goes, the 
more benets the investor gets, but also the further up it goes, the more 
money will be lost. 
ˆ It could be considered as a third action to stay away of the market as 
there is no need to always be actively participative in the market, and this 
is probably the most important part of investing; knowing when to stay 
away. This way money is kept, so there is no risk as well as no possible 
benet. 
Any non-professional investor can freely buy and sell any kind of instruments 
using a broker as an intermediary, which typically is a computer software. There 
are plenty of available programs online, and they mostly work with commissions, 
meaning that they keep a small amount of money for each transaction the client 
makes. This is one of the main obstacles found if someone wants to get hands 
on with non-professional investing, the initial negative odds. A standard broker 
charges around the 0.01% of each transaction, either if buying or short selling 
stocks, to have an initial idea of the taxes these programs operate with. On one 
hand, in the long term this will become a large amount of money taken, and 
on the other the fact that a random investing strategy when the market keeps 
stable in a long term, will be very prone to end up with loses. 
1.4 Structure of this report 
The present document will be divided as follows: In section 1 a brief introduction 
has been exposed, together with some basics of the stock market; section 2 
will explain the basics of the time series prediction, mainly regarding neural 
networks. The experimentation process will be explained in section 3 and the 
models used during this process in section 4, with their results in section 5. 
A combined model will be explained in section 6; an overview of the problems 
found will be shown in section 7; and at the end of the document the conclusions 
and some future interesting work will be exposed, in sections 8 and 9. 
2
2 Time series prediction 
2.1 Articial Neural Network basics 
Before starting to go through the background of the dierent approaches, a 
quick overview of Articial Neural Networks (ANN) should be given, due to it 
being one of the basic common tools of several approaches. An ANN is a com-putational 
model capable of machine learning generally presented as systems of 
interconnected neurons which can compute values from inputs. These neurons 
harbor numerical values and are typically grouped in sets called layers. A min-imum 
of two layers is needed to set a neural network, one to read the inputs, 
with one neuron per input value; and another one to write the outputs, with 
one neuron per output as well. 
One of the most popular types of network is the multilayer perceptron, where 
every neuron of a layer is connected in only one direction to every neuron of 
the following layer, so that each neuron is reached by the all neurons of the 
predecessor layer and reaches all the neurons of the following layer if any. Every 
layer that is not the input or output one is called a hidden layer, and an ANN 
can consist of one or more hidden layers. Figure 1 shows the architecture of 
these multilayer articial neuron networks: 
Figure 1: Basic Articial Neural Network with one hidden layer. 
In the Figure is shown an Articial Neural Network with an input layer, X 
of n neurons; a hidden layer Z with p neurons; and an output layer Y with 
m neurons. Each single connection contains a weight, shown in the graph as 
V or W and two subscripts representing the reached and the reacher neurons' 
3
positions in their corresponding layers. To compute the output layer neurons' 
values, the following formula applies to every neuron of every layer sorted in 
order from the input to the output ones, and updating their values pi by a after 
the formula is calculated: 
iX=m 
a = f( 
i=1 
(pi  wi) + b) 
This way each layer requires the completion of the predecessor layer's com-putations. 
Also, a nal formula f is applied to the output value with the aim 
of reaching a better or quicker learning. Typical formulas are linear or sigmoid 
[6], in order to emulate the behavior of the step function, which will provide a 
more aggressive learning, as it would always be either 1 or 0. The formula of 
the sigmoid function is as follows: 
(t) = 
1 
1 + e
t 
Where the greater the beta is, the closer the sigmoid is to the ideal step 
function, but a too large beta will lead to a longer computational time. In the 
Figure 2 the dierence between a sigmoid function with beta=1 and the step 
function can be veried. 
Figure 2: Sigmoid (left) and step (right) functions. 
As the training of the networks is a bit more complex and is not essential for 
the understanding of this document, we will not go into too much mathematical 
detail. Just to mention that the most common way to train the network is with 
a backpropagation of errors, starting from the output until the input layer, 
where a gradient of a loss function is calculated with respect to all the weights 
in the network. This gradient is afterwards used to update the weights of the 
connections, together with some parameters such as the learning rate or the 
momentum, which tunes the network with the aim of transforming it into a more 
accurate one. Further information can be found in plenty of books and articles 
[4][10][11][13]. The learning rate is a ratio that is multiplied by the gradient to 
update the weights of the neurons. It inuences the quality and speed of the 
4
training: the greater the learning rate is, the quicker the network will learn, but 
the lower the ratio, the more accurate training. In Figure 3 a small learning rate 
is shown in the left, where the problem converges very slowly, and a learning 
rate too big is shown in the right image, where the problem diverges. Both 
dierent learning rates are applied to a same problem where the aim is to nd 
the minimum error (x axis) with dierent results. 
Figure 3: Repercussion of a small value for the learning rate (left) and a too 
large one (right) over a training curve. 
The momentum is a parameter that represents what could be called the 
inertia of the learning, extending the actual learning in a proportion given by 
this parameter. A momentum equal to zero does not aect the original learning 
of the net, and a greater momentum allows it to train faster and might avoid the 
network getting stuck in local minimums. On the other hand, a momentum too 
big means that the ANN will learn too fast and will probably miss the global 
minimum that the network is looking for. 
The utility of the neural networks mainly resides in the fact that they can 
be used to infer a function from observations. Dierent scopes where articial 
neural networks can be applied are as wide as pattern recognition, game-playing 
decision making, spam ltering, sequence recognition and many more. 
2.2 Dierent techniques 
For the concerning problem, a lot of approaches have been made in order to 
predict the tendency of markets. In terms of Articial Neural Networks, most 
articles focus on Recurrent Neural Networks, which are a kind of network where 
connections between units form a directed cycle. This creates an internal state 
which allows the network to exhibit dynamical temporal behavior [7]. These 
kinds of networks are suitable for predicting time series, but their main with-drawal 
is the diculty they have in converging, which becomes a bigger problem 
with high noisy series such as stock market ones. Dierent processes have been 
applied to these networks to improve their results like self-organizing maps or 
grammatical inference [9]. 
5
Other techniques that dier from ANN have been used as well, such as Sup-port 
Vector Machines [12], Genetic Algorithms [8], or combinations of dierent 
models, techniques and approaches in order to maximize the results. Popular 
models in this area to combine results are boosting and bagging [15], which are 
like an add-on of the initial models to try to perform better. 
2.3 Proposed approach 
After verifying several types of approaches together with their results and com-plexity, 
the decision made was to start the experiments with a simple regular 
multilayer perceptron (MLP), using backpropagation as its training method. 
Just a regular neural network is a relatively simple tool that has a good pre-dictive 
potential if the data is well organized, and this together with the fact 
none of the methods mentioned in the above section have shown outstanding 
results even though they are more complex, leads to the use of an initial MLP 
to perform this task. Afterwards, some modications will be added to the basic 
model with the objective of improving its performance, which will be explained 
in further sections, and comprise things as replacing the initial input layer from 
a list of values by dierent averages of the values, or the output layer from a 
binary vector to a single rational number. Additionally, modications in the 
architecture will be considered, as well as an exhaustive scan of the dierent 
parameters that might aect the results obtained. Slightly more complex mod-i 
cations will be made, like the substitution of the traditional way of splitting 
the data to train and test the network by a new model where an overlapping 
of samples is considered with the goal of seizing the data better, or a hybrid 
system in between the traditional model and the overlapping one. 
6
3 Experimentation process 
3.1 Tools used 
To carry out all the experimentation of the project, many tools and elements 
have been considered and several of them used. One of the most important parts, 
as mentioned in previous sections, has been the platform of Yahoo Finance, 
which gathers historical data from the main stock markets and allows anyone 
to download it. About the software used, the rst attempt was to use Theano, a 
Python library, but after a few experiments it was decided to change to APRIL-ANN 
[1], which is based in the scripting language LUA [2]. It was mainly chosen 
because it is developed solely for working with Articial Neural Networks, in 
order to improve in terms of eciency. It was additionally chosen due to the 
fact that both director and co-director of the present project are taking part 
in its development. All the pre-process and post-process of the data has been 
done with Python, for the mere reason of familiarity with it and it being a 
powerful scripting language. Dierent external Python libraries have been used 
for dierent purposes, such as the library urllib for downloading the data from 
the yahoo platform, the library csv for working with such les, multiprocessing 
and threading to speed up the process or typical handy Python libraries as math 
or collections. These were all set in a Intel Core 2 Duo (2,00 GHz) with 4 GB 
RAM, running Ubuntu Linux 13.10. 
3.2 Basic strategy and data used 
The stock market oers many possibilities, permitting investors to buy, sell and 
keep whatever and whenever he wants. It is for this reason that is needed to 
put some boundaries in the system before establishing a forecasting model, so 
that it can be studied more easily. Given that one of the most popular sources of 
public historical stock market data is Yahoo Finance, this platform will be used, 
as it has daily data available since the early nineties. The daily data provided 
by Yahoo Finance has for each day its day, opening value, max value, min value, 
closing value, its real volume and an adjusted closed value, which is the closing 
value modied when dividends are paid. Back to the system, the boundaries 
will set as follows: 
ˆ From the historical data, only the date and the percentage of change will 
be used, which is calculated as the relative dierence between the adjusted 
closing value of one day in respect to the same value of the day before. 
ˆ The investing strategy will be to perform an action at the opening time 
and keep it until the closing time of the same day. This means that the 
adjusted closing value of the day before will be used as the initial stock 
value and the adjusted closing value of the current day as the last value. 
ˆ The focus will be put in trying to predict the direction of change of the 
market, instead of predicting the value itself. Empathizing on the practical 
7
nancial way more than in the precision of the predictions, although both 
measurements are closely related. 
ˆ All the historical data until one point is available to predict the direction 
of change of that point, meaning that if tomorrow's change wants to be 
predicted, all the data until today would be available. 
ˆ The initial date used as the beginning of the data will start from dierent 
points in time for dierent experiments, but it will never be older than 1st 
January 2000. 
ˆ For the rst experiments a stock from the Spanish stock market IBEX 35 
has been used, arbitrarily chosen by alphabetical order as a regular stock, 
Abengoa Abertis. Other series have been analyzed below to get a better 
understanding of the series' predictability. 
3.3 Performance measurement 
The rst thing that has to be set is a common evaluating model for all the ex-periments. 
Talking about ANN, the main measurement is the error obtained in 
validation and test datasets, which represents how well the network has learned 
the samples of these datasets. Typical errors that have been used in this ex-periment 
include the Mean Squared Error (MSE) which is the square of the 
dierence between the estimator and what is estimated; or the cross-entropy 
error method, which gives an estimation of how similar two distributions are. 
From a more nancial point of view, dierent ways of measurement are 
needed, which go further than purely mathematical ones. One of them is the 
percentage of success, which basically is how often the selected action is right. 
The main disadvantage of this method is that not all the actions have the same 
eect, for instance, assuming four days when the market goes up 0.1% and a 
fth when it goes down 3.4%. The success rate would be of the 80%, but more 
than the 3% of the money would have been lost. This is not the most common 
of the cases, but it is something worthy of being considered. 
Another way of measuring the eectiveness, which has been used in several 
articles regarding stock market prediction is a simulation of the actions. Sup-posing 
an initial capital of $100, the actions predicted by the system are applied 
to this amount of money, which will be modied according to the real series' 
uctuations. This method gives a very simple idea of how the system performs. 
The main disadvantage of this method is that it does not consider the number 
of actions performed. For instance, a nal amount of $115 can be fairly good 
if just 10 actions have been undertaken, but it is a terrible result when 400 
actions have been performed, mainly because of the tax applied by the brokers, 
as explained previously, which will end up in a loss of money. It can be pro-posed 
as a solution to this disadvantage to divide the dierence between the 
initial $100 and the nal amount by the number of actions undertaken, but the 
problem would then be that the more money you are moving, the more impact 
this action receives, which would not be fair either. 
8
As a last practical error measurement we can use the average rate obtained in 
the simulation. Each time an action is performed the original dierence would be 
added to the rate if the action is right and subtracted if it is wrong, dividing this 
value by the total number of actions performed. This way an average percentage 
of the gain per transaction will be obtained with the main disadvantage of not 
knowing the number of actions. For instance, a rate of 0.7% in 50 actions over 
a period of 3 months is better than a rate of 1.2% in the same period when just 
one action has been performed. Something to consider here is that a positive 
rate does not always result in benets at the end. An example of an extreme 
case would be, starting with $100, rst getting a +60% ($160) and then losing 
the 40% ($64) would mean a sublime ratio of +20% but a loss of $4 at the end. 
To sum up, there is no perfect error measurement for this problem, but there 
are several ways that combined can give a very good idea of how the system 
performs. All of them will be used in order to contrast the results obtained in 
each of them, mainly focusing on the last one of ratio of benets but always 
keeping an eye on the number of movements carried out. 
Lastly, the results will be compared with few simple strategies as a Random 
Walk or the evolution of the market, which would be buying the stocks the rst 
day of the determined period and keeping them until the last. 
3.4 Data preprocessing 
Before starting with the experimentation itself, the data must be shaped in a 
way that can be easily read by APRIL-ANN. The rst thing to do is down-load 
the historical nancial series, as we mentioned above, of Abengoa Abertis 
(ABE.MC in Yahoo Finance), due to it being a regular share in the Span-ish 
Stock Exchange, and the data interval will be from 1stJanuary 2000 to 
1stApril 2014. The period of time used only to predict will be starting on the 
1stSeptember 2013 onwards, or a total of 7 months or 151 days of activity in 
the Madrid Stock Exchange, and will be the same for all series, such that the 
results can be compared afterwards. 
The rst thing to do is to represent each single day of this more than 14 
years series as its date plus a single number representing the relative dierence 
in respect to the day before. With this, we are losing the rst and last elements 
of the series, because we do not know the dierence between the 1stJanuary 
2000 and the last day of 1999, and the same applies to April 2014, leaving us 
now 150 days of activity. Nevertheless, it is still far better than using absolute 
prices of the shares, which can vary in terms of magnitude in a matter of days. 
Once each single day's price dierence of the series is calculated, the input 
and output of the network have to be generated from them. In the rst and 
most basic experiment, a regular multilayer perceptron will be used, where the 
input of the network consists in a sliding window along the series of length N. 
With this method, the input of the network will be the values from the time 
t-N to t for predicting the value of t+1 as the Figure 4 shows: 
9
Figure 4: Time line showing the sliding window used in order to predict the 
values of t+1 and t+2 respectively 
When the value t is available, the window from t-N to t is used to calculate 
t+1, and when t+1 is available, the window slides one position, from t-N+1 to 
t+1 in order to calculate t+2. 
The output used in this rst model consists of a binary vector of three 
elements for each sample, representing the ideal action to perform on that day, 
according to the tendency of the series: down {1,0,0}, remains {0,1,0} or up 
{0,0,1}. As the aim is to maximize the benets, the market going down will be 
understood as a sign of selling, the market remaining constant as do not perform 
any action and the market going up as buying. The threshold used to decide 
when to remain is when Abs(value)  0.65, meaning that the ideal action would 
be buying when the share increases its price more than 0.65%, selling when 
the share's price is -0.65% or lower, and remain inactive otherwise. With this 
distribution there will be approximately one third of each of the actions along 
the series. 
Another matter to consider is the initial length of the series to analyze, as 
well as other parameters like the size of the sliding window N. Dierent values 
for both these parameters will be tested and analyzed in further sections, but 
the concern for the moment is the repercussions of these parameters in the nal 
length of our series. A starting date for the series will be needed, meaning that 
no data prior to that date will be available for the experiments at all, and the 
window size will need some initial data before the rst sample is available. For 
instance, if the starting date is the 1stOctober 2012 and the window size is 4, 
the rst sample available will be on the 4thOctober, because the rst 4 days will 
be used to generate this sample. The second sample will contemplate the values 
from the 2ndto the 5thof October and so on. To sum up, it should be kept in 
mind that the window size has to be subtracted from the initial length of the 
series to obtain its nal length, something that has the potential to cause some 
problems if it is not considered, mainly when using big window sizes and/or 
recent starting dates. 
The last important part of the preprocessing of data would be the splitting 
of the data. As previously mentioned, this rst experiment will be a simple mul-tilayer 
perceptron, such that data must be split in three datasets; one for the 
training of the network, another for a rst validation of this trained network, 
and a third dataset for a second validation of the system, which comprises a 
xed length from 1stSeptember 2013 to 31stMarch 2014 in all the experiments, 
regardless of the size of the other datasets used. The remaining data, includ- 
10
ing all the samples older than the second validation period will be split into 
training-validation 1 with the proportion 0.75 for the rst, and the remaining 
0.25 for validating the trained system. The data from April onwards will be 
used afterwards to test the network selected in base to its performance in the 
validation 2 dataset, as can be seen in the Figure 5. 
Figure 5: Time frame where the splitting of the data is shown for an undeter-mined 
starting date. 
The problem mentioned previously can appear with this way of splitting 
data, as depending on the starting date and the window size the number of 
samples might not be enough to cover the whole set of needed dates for the 
second validation. Or the remaining data for training and rst validation might 
not be large enough after removing the validation 2 samples. In these cases, 
the experiments will simply not be considered. For example, it would not make 
sense to set an experiment with data from the 1stJuly 2013 and a window size 
of 30, basically because the dataset for training and validation 1 will be of just 
14 samples (10 training and 4 rst validation), and the samples for the second 
validation will still be 150. 
As a nal comment, it should be remarked that the data classied as vali-dation 
2 is the typical testing dataset in the train-validation-test splitting, but 
a further test dataset will be used, and the best parameters will be chosen in 
order to maximize the results obtained within this validation 2 period. The real 
potential of the experiments will be shown in the test dataset in a more recent 
period from the 1stApril 2014 onwards. 
3.5 Post-process of the data 
Another highly important point of the experimentation is to process the data 
after the networks have been trained, matter covered along this subsection. 
First, immediately after the execution of the training, the second validation 
dataset will be processed by the best network according the rst validation 
dataset and its error outputted in a summary le that will be created for each 
single conguration of parameters, where information as the evolution of the 
training is kept, such as the epoch where the best net had been trained or both 
validation datasets' errors. 
The dierent performance measurements are calculated on the validation 2 
dataset as well. First, for each sample of the dataset in a sorted order, its pre-dicted 
action is calculated, and simulated in an amount of $100 from September 
2013 to March 2014. It is taken into account if the action was a success (buy 
11
when the series goes up and sell when it goes down) or not, and the condence 
of each action is stored as well, calculated as the ratio between the greatest 
neuron's output and the second greatest in a natural scale after the activation 
function is applied. Assuming o1 as the greatest output, and o2 as the second 
greatest, the ratio would be the exponential of their dierence, and the con- 
dence would be 1 - ratio. After every single sample of this dataset has been 
analyzed, summarized information as the number of actions, the ratio per ac-tion 
or the success percentage is calculated, in order to have an outline of every 
dierent trained network. 
After the execution of each network, a trace of its behavior is saved in a 
corresponding le together with some interesting information as the condence 
of each action performed. After the execution of the networks, one of the faced 
problems is that the number of actions might be too high, driving the results to a 
low performance. The rst idea coming up to solve this problem is to use a xed 
threshold, so that all the actions with a condence lower than this threshold 
will be ignored. The problem that appears here is that in some executions there 
are no actions performed at all because this threshold is too restrictive, but 
in some other executions the threshold does not bound any actions of the set. 
The solution, to use a variable threshold depending on the set of condences of 
the series. A parameter indicating the percentage of actions to consider will be 
needed, and the action plan will be to sort the list of condences and with the 
parameter's help choose the one that will act as a threshold. This way dierent 
series with dierent condences can be compared, because the task is done with 
relative numbers instead of absolute ones. 
Further restrictions can leave some experiments out of consideration. One of 
them is a minimum number of movements required after the threshold is applied. 
A set of 150 samples is considered for the second validation, so for instance an 
experiment that ends up performing only one action out of these 150 possible 
ones cannot have very good odds, so it will be discarded. Another constraint 
will be the number of the best epoch, as networks that are classifying the data 
randomly are not desired. The starting weights of the network's connections are 
set randomly, and if after 200 epochs of training, the rst epoch is still the best 
one, something is not going well, as the training of the network has not been 
able to improve a random one, so it would be discarded. 
As an example, a results set where the predicted actions are: 50 samples buy, 
50 samples keep and 50 samples sell; which means a total of 100 proper actions. 
Assuming that the best epoch was high enough, a hypothetic top 5% of the result 
would be quite poor, because only one out of thirty actions would be performed, 
but a higher threshold as 25% would probably be better, as now one out of 6 
actions will be carried out. When it comes to the practice, this parameter will 
have to be examined as well in order to nd the optimum percentage of samples 
to take into account. As a minimum number of movements needed the threshold 
will be set to 8, same value as the best epoch's minimum number of the network's 
training. 
12
4 Models used 
4.1 The basic model 
As it was mentioned before, the initial model will be a regular multilayer per-ceptron 
with backpropagation as its training method. The rst thing to do is 
normalizing all the data to standard deviation equal to one and mean zero in 
order to equally distribute the data and facilitate the learning. When creating 
the neural network, one hidden layer with logistic as the activation function 
in its neurons will be used, and in the three neurons of the output layer, the 
function chosen will be logarithmic softmax. The loss function used to train will 
be cross-entropy, which computes it between the given input/target patterns, 
interpreting the ANN component output as a multinomial distribution. A batch 
size equal to the number of training samples will be used, meaning that all 
the samples are read before the actual network is updated, which means more 
computation time to process each step, but more accurate steps. 
The initial weights will be randomized at the beginning, between the values 
-0.1 and 0.1 with the purpose of having a neutral network before the training. 
A pocket algorithm will be used, meaning that the network with the best results 
will always be available, even if afterwards new training iterations worsen it. 
The network will keep training until the current epoch's error is twice as big as 
the error of the best epoch, with a minimum of 200 iterations and a maximum of 
3000. The parameters needed to tune the training of the network will be given 
as the arguments of the APRIL-ANN program, in order to use bash scripts 
afterwards that will wrap the execution of the network together with more 
dependent and depended scripts, which are the size of the hidden layer, the 
learning rate and the momentum. 
Finally, the system needs to be tested, and a scan of parameters will be 
done for this purpose. The rst parameter will be the starting date of the series. 
The series are available from January 2000 to August 2013, as September 2013 
and after is part of the validation 2 dataset. Starting dates from January of all 
the odd years from 2000 to 2012 and from 2011 and 2013 will initially be used 
together with dates starting in July of the years following 2010. The size of the 
sliding window is another important parameter to scan, which also sets the size 
of the input layer. The initial set of values to check here goes from 5 to 200, in 
order to have an initial idea and proceed with more concrete values afterwards. 
The next interesting parameter will be the size of the hidden layer, aecting the 
topology of the network. The values used here are the same as for the sliding 
window and again, further experiments will be performed for the values that are 
close to the best results. Another variable for the performance of the network 
is the learning rate, which initially will be analyzed from 0.001 to 0.5. Given 
that this parameter strongly depends on the size of the network, which in this 
case is determined by the sliding window and the hidden layer sizes, new scans 
will have to be done once the range of these two parameters is smaller. The 
last parameter to analyze is the momentum of the network. Here not so many 
options are needed, so the starting values will be 0.0, 0.05, 0.2 and 0.4. 
13
4.2 Variants of the model 
Now that the basic architecture of the network is understood, some dierent 
modications will be exposed before going on with the pertinent results. First, 
a change in the input of the network will be presented, then an alternative for 
the output, and nally a modication of the training process will be explained, 
changing the order the data is given to the model. Finally, in a new section, a hy-brid 
model will be presented, as an attempt to put together the main advantages 
of both learning ways, with two slightly dierent alternatives. 
4.2.1 Sliding window vs Averages as inputs 
The easiest of the proposed modications aects the input information that is 
being passed to the network. In the basic model, a sliding window taken directly 
from the original series was the input. The main problem that this method 
presents is that in order to recognize a new pattern with a high condence, an 
almost identical one should have been used for training, which is very dicult 
taking into account the noise allocated in the nancial series. Another way of 
seeing it is that, this way, the network is learning the data by heart, which 
makes it dicult to generalize afterwards. 
The proposed alternative is to use averages instead of the raw values from 
the series, with the objective of learning the tendency of the series more than 
the number themselves. Instead of having the window size as one of the variables 
of the system, this variable will be a vector where each element represents the 
amount of values used to calculate each of the averages that will be used as an 
input, always starting from value right before the one that wants to be predicted. 
For instance, the vector {9,6,3,1} would mean that the rst element of the input 
layer would be an average of the 9 last elements, the second would be the average 
of the last 6, the third would be the average of the last 3, and the last one would 
be the average of the last one element, in other words, the last element itself, 
as can be appreciated in the Figure 6. 
Figure 6: Gathering of information to generate four inputs in an {9,6,3,1} aver-ages 
model. 
14
In the previous picture it can be seen that the size of the input layer in this 
case would be of four neurons, each of them denoted as i#, where the hashtag 
represents its number, containing the averages of the values encompassed in the 
gure. To get a better understanding of the dierence of both methods, a series 
is represented with both methods in the Figure 7, where the averages model 
uses a vector {20,15,10,5}. 
Figure 7: Comparison of the uctuations of the same data represented as Raw 
and as Averages of the last values from t-20 to t. 
In the gure can be seen the values of a series' uctuations from the last 
twenty days in both raw values and averages of 20,15,10 and 5. The one using 
averages is more general, but contains less information, hence is easier to learn. 
4.2.2 Three-class classier vs Forecasting model 
The next interesting change of the basic model aects the output of the network, 
where in the basic model a binary vector was used representing if the day after 
the market went up, down or kept its value. The alternative approach consists 
of replacing this output layer of three neurons by a layer with a single neuron, 
which contains the real value provided by the nancial series. The main benet 
of this resides in the fact that with only one output, there is no reduction of 
information in the model. In other words, the model with three outputs considers 
a raise of 0.7% and a raise of 5% as the same, when the real repercussion that 
the second causes is much higher than the one caused by the rst. Or a slight 
dierence between two similar values, such as 0.64% and 0.65%, which are pretty 
much the same but are considered as completely dierent outputs. An example 
of the dierent types of outputs can be seen in Table 1. 
15
Date Current value Trend class Forecast 
2013-01-31 -2.445 {1,0,0} -1.585 
2013-02-01 -1.585 {1,0,0} -3.768 
2013-02-04 -3.768 {0,0,1} 2.197 
2013-02-05 2.197 {0,1,0} -0.462 
2013-02-06 -0.462 {0,1,0} -0.516 
2013-02-07 -0.516 {0,0,1} 2.000 
2013-02-08 2.000 {1,0,0} -1.177 
2013-02-11 -1.177 {0,0,1} 1.932 
2013-02-12 1.932 {0,0,1} 0.868 
2013-02-13 0.868 {1,0,0} -0.707 
2013-02-14 -0.707 {1,0,0} -1.178 
2013-02-15 -1.178 {0,1,0} -0.506 
2013-02-18 -0.506 ? ? 
Table 1: Example of the dierent outputs for the IBEX35's series in a comprised 
period from 2013-01-31 to 2013-02-18. 
This modication entails two main dierent changes in the neural network 
apart from the topology. One of them is the activation function of the output 
layer, which until now was a softmax, but as from now on the values do not need 
to tend to a discretization, but are continuous instead, the activation function 
will be linear instead. The other change regards the loss function, which for the 
classier model was a cross-entropy, but as there is only one value now, this 
function would no longer make much sense, so it will be changed to the mean 
squared error (MSE). 
With only one output the problem becomes a forecasting model instead of 
a classication in three classes as it was before. As mentioned previously, the 
principal advantage is the dierent importance of each value for training, in 
order to distinguish strong and weak tendencies, but there is also a negative 
side, predominantly concerning two problems. The rst one is merely technical; 
basically a forecasting model is not as stable to train as a classication problem. 
A forecasting model is more likely to diverge, mostly when high learning rates 
are used, but it also does not have to to converge when smaller rates are used. As 
lots of dierent experiments are run, sometimes it can be very dicult to know 
if the network has converged enough or not, as with the highly noisy nature of 
the data a random network can easily provide decent results that can lead to 
confusion. The second problem to face with this method is more practical, and 
resides in the fact that the highest short time peaks of the stock market are 
normally caused by important news, which is in fact unpredictable. This means 
that the samples that will have a greater impact on the system are the ones 
that probably should not be learned by it, although they are not abundant. 
16
4.2.3 Traditional MLP vs Model with data overlapping 
This last modication regards the organization and order the data is given to the 
system to train, validate and test. It arises from the idea of the dierent contexts 
that might have an eect throughout a series. Social, economic and historical 
features are very dierent nowadays than before 2005 for instance, mostly with 
an economic crisis in-between, which make markets behave dierently. For this 
reason, the objective of this modication is to train with data chronologically 
closer to the data that is going to be predicted, in order to reduce the dierence 
of contexts. 
In the regular MLP model explained up to this point, the second validation 
data was comprised from 1stSeptember 2013 to 31stMarch 2014, the rst valida-tion 
data was the more recent 25% of the remaining samples, and the training 
data was the remaining oldest 75%. When the series are long, and they can 
be of as long as 14 years, a big gap exists between the data used to train the 
network and the data used to do a second validation and or a test. Concretely, 
starting in January 2000, the last data used for training is from the beginning 
of 2010, which leaves more than three years used for a rst validation that ba-sically 
means predicting data of 2014 using a network trained with data older 
than 2010. This is an extreme example where the easy solution would be to sim-ply 
reduce the size of the series, as probably so much data will not be needed, 
but even if the data were reduced, the same problem would appear in a smaller 
scale. 
The proposed solution for this is to avoid the rst validation dataset in order 
to put training and testing datasets closer. To do so, a model where only one 
dataset exists is proposed and the network iterates over it in chronological order. 
Given a concrete sample, it will be used to test the network and in the next 
iteration will be used to train it, whilst it will be tested with the following one. 
With this method, each single sample of the network will be used rst to test 
it and afterwards to train in order to not test samples that have been used to 
train. For instance, assuming a new iteration, the rst thing to do would be 
to use the current sample t-1 to train the network and immediately after the 
sample t will be used to test it. In the next iteration, t will be used to train 
the network and t+1 to test it. This sequence will continue until the last value 
of the series has been used to test it. The errors are calculated the exact same 
way as they were before, with the only dierence being that in the overlapping 
of data model they are calculated whilst the training is being done. Also, the 
overall splitting of the data will be kept, meaning that until 31stMarch 2014 the 
samples will be used to train and validate, and from April 2014 onwards the 
samples will be used to train and test. As one longer continuous dataset will be 
used for the second validation and testing of the data, the only dierence will 
be that the best results until March 2014 will be picked, and there will be no 
choice from April onwards. 
A simple outline of this process can be seen in Figure 8, replacing the regular 
method shown in Figure 5. 
17
Figure 8: Scheme showing the proposed method with overlapping of data in one 
single dataset. 
The main advantage this model has is the absolute utilization of the data to 
train the network, and the factor that by learning all the samples in a chronolog-ical 
order, the more recent the sample is, the more impact it has on the system, 
so that it will forget old samples by learning new ones and modifying the 
system according to these. On the other hand, there is a big disadvantage: the 
undertting, which will be explained in detail in section 7.2 and can occur due 
to the network using each sample to train only once during all the process. This 
can be patched up by increasing the learning rate or by using an adequate num-ber 
of iterations determined by the length of the network, so that the network 
will iterate the correct number of times. Both learning rate and series' length 
will be parameters to scan and analyze afterwards, as will be seen in posterior 
sections. 
4.3 Hybrid model with overlapping of data 
Up to this point, dierent modications have been explained with slight dif-ferences 
from the initial model; a traditional multilayer perceptron. The most 
uncommon model is probably the one with the overlapping of data, which does 
not use a typical splitting of data that neural networks normally use, adding the 
18
small advantage of seizing the data better than regular models at the expense 
of the huge disadvantage consistent in data undertting. In order to abate this, 
two new models will be presented with the purpose of avoiding the undertting 
problem, but without losing the advantageous data seizing. 
4.3.1 Explicit validation dataset 
The rst of the alternatives is also the simplest one, based on the overlapping 
model; a full training of a network is done for each new available sample. Another 
way of understanding it would be starting from the basic model and using only 
one sample as the second validation dataset, instead of the 150 samples used 
before, and iterate over the old whole set. Once this prediction is performed, all 
the datasets advance one sample in time, so that the predicted sample is now 
used as the last one of the validation dataset, while the following sample will be 
predicted now. With this model, a completely new articial neural network is 
created for each sample that is needed to be predicted, having a dierent num-ber 
of the best training epochs, as they depend on the samples of the datasets, 
which are being modied each time. The number of samples used in total for 
the prediction of each tendency in time would remain constant for each of the 
predictions, as gure 9 shows. The splitting of the training and the rst valida-tion 
dataset will be kept as 75% and 25%, same as it was in previous models, 
and the starting date will be considered as a network's input parameter as well. 
Figure 9: Training methodology of the model with validation dataset. 
19
There is an obvious disadvantage in comparison with previous models, which 
is that the time spent by the model to predict the samples increases considerably, 
as now one network is trained and used to predict each sample. In this case, 
where 150 samples are available along the validation dataset, the time spent 
in previous models gets multiplied by 150. Due to the nature of the problem, 
where only one more sample is available per weekday, this is not a big issue, as 
in between the closing time and the opening time of the following day there is 
plenty of time for training the new models and predicting the new tendencies. 
However, the process of looking for the correct parameters is very expensive 
in computational terms, taking around 150 times longer than previous models 
where the whole second validation dataset was predicted with the same trained 
network. A positive side of this is problem is that because this model is consider-ably 
similar to the anterior ones; the scanning of parameters should not be very 
wide, as the ideal parameters for the other models are already known. Hence, 
the scanning of parameters should shorten, with its corresponding reduction of 
time. 
4.3.2 Implicit validation dataset 
The last of the models to analyze is an evolution of the previous one; the hybrid 
model with validation dataset with certain characteristics of the overlapping 
model. The main idea of the model is to use a training dataset chronologically 
as close as possible to the sample that is to be predicted each time by moving 
the validation 1 dataset used in previous models. If it were just removed, the 
problem faced would be that the stopping criterion would be undened, as it is 
set according to this validation dataset. The proposed solution to this problem is 
to set a x number of training iterations for each sample's prediction, determined 
by the best epoch obtained in previous full trainings with a validation 1 dataset. 
The prediction of a determined sample is performed as follows: rst, a net-work 
is trained using both training and validation 1 datasets in order to predict 
the sample that is immediately after the validation dataset, as was done with 
the previous model; then the number of the best epoch is kept and the trained 
network completely discarded; later a training dataset of the same length as the 
one used before is taken from immediately before the sample to be predicted, 
and the training is performed with the same parameters during the stored num-ber 
of epochs; nally the resultant trained network is used to predict the sample. 
In Figure 10 a detailed process of the training method is shown, where the rst 
part of each iteration is used to get the number of the best epoch and the second 
to train the actual network with a xed number of training epochs. 
20
Figure 10: Training methodology of the hybrid model with implicit validation 
dataset. 
A problem that might appear with this method is that sometimes when a 
network is trained the number of the best epoch could be one, meaning that the 
training has not improved the initial random model. If the parameters used are 
correct, this will not be a common problem, but can still happen. The proposed 
solution to this problem is to use more than one previous training to determine 
the xed number of epochs, by calculating their average. The number of old best 
epochs used to calculate the average will be seven, the last consecutive ones. 
21
When the best epoch of a training equals one, the average drops down and 
sometimes it can have a great impact on the average number of epochs. To solve 
this, the solution is to remove the lowest of the seven epochs from the average, 
removing as well the greatest so that the average does not become unbalanced. 
At the start of the series, the average of the rst networks will be used as 
the training epochs for the same number of networks because no previous data 
is available. Table 2 shows an example of a series where the number used to 
calculate the average is ve, three after the removal of the lowest and greatest 
best epochs. 
Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13 
Training BE 18 25 20 19 17 24 29 14 25 1 21 19 20 
Iterations 19 19 19 19 19 21 21 20 22 21 20 18 21 
Table 2: Example of the number of iterations calculated out of the best epochs 
from the previous 5 samples. 
In the table can be seen the resulting number of iterations out of the last 5 
samples. For instance, for the sample number 10 the values 24, 29, 14, 25 and 1 
are available. Removing the greatest and lowest, which are 29 and 1 respectively, 
the values 14, 24 and 25 are remaining. Calculating their average, the obtained 
number of iterations is 21, which will be set for the network's training used to 
predict sample number 10. 
It is important to mention that this model takes on average almost twice 
as time as the hybrid with a validation dataset, which was already taking 150 
times more than the initial models. It rst needs to train the networks the exact 
same way as the previous model was doing, and afterwards train a new network 
with dierent samples up to the best epoch of the previous training. When 
searching for the best epoch, the training keeps iterating even though when the 
errors obtained are worsen the best one, at least the 50% more of the current 
best epoch's number. This means for instance that in a training where the best 
epoch is reached during iteration number 600, the network will iterate another 
300 epochs, until epoch 900 and if the best epoch is still number 600, then it 
will stop. During the training of the actual network, only 600 iterations would 
need to be performed, a considerable saving of time depending on the case. In 
general terms, it can be said that this second hybrid method is approximately 
70% more expensive than the rst one. 
22
5 Results 
In this section a comparison of the dierent alternative models will be shown 
starting from the basic system's results. Note that as mentioned in previous 
sections the validation 2 dataset comprising from September 2013 until March 
2014 will be used to measure each system and modication, and further analysis 
will be done in order to check the networks with unknown future values of the 
series. 
The rst thing needed is a baseline for the results, so several random walks 
were generated for the series. For each single day of the validation 2 period, one 
action has been randomly picked up with an equal distribution out of buy, sell or 
keep. Table 3 shows some information of ten random walks executed, considering 
as the number of actions the sum of both buy and sell actions, excluding the 
keep ones. 
# Final money Actions Benet/action Success rate 
1 $84.03 88 -0.192% 42% 
2 $97.94 102 -0.015% 47.1% 
3 $108.88 95 0.09% 50.5% 
4 $113.07 92 0.14% 51.1% 
5 $107.28 107 0.072% 56.1% 
6 $90.85 103 -0.087% 44.7% 
7 $108.97 100 0.091% 45% 
8 $107.76 91 0.088% 54.9% 
9 $93.84 109 -0.053% 46.8% 
10 $100.26 94 0.009% 45.7% 
Average $101.288 98.1 0.000143% 48.39% 
Table 3: Execution of ten independent random walks showing the nal amount 
of money, number of movements, ratio of benet per action and success rate, 
together with the average of them all. 
In terms of the money obtained, the table shows that on average, using a 
random walk strategy, the benet after 98 actions would be $1.288, not very 
good. The best performance of the random walk (number 4) has obtained a 
benet of $13.07 with a rate of 0.14% per action. On the other hand, the worst 
has been number 1 with a total loss of $15.97, meaning an average loss of 0.192% 
per action. The median execution is also the closest to the average, number 10, 
remaining very close to the initial sum of money, with $100.26. In Figure 11 can 
be seen a comparison of the best, worst, and median random walk executions 
together with the uctuation of the original series itself. 
23
140 
130 
120 
110 
100 
90 
80 
Original 
Best 
Worst 
Median 
01-09-2013 15-12-2013 31-03-2014 
Figure 11: Comparison of the best, worst and average random walks against the 
original series. 
In order to have a better idea, another 100 random walks have been executed, 
showing an average end money of $99.08 with a standard deviation of 10.66. This 
reinforces the idea that the series is not biased neither to win nor lose money, 
but to maintain its value. 
5.1 Basic model 
As mentioned in previous sections, a scan of parameters is performed, generating 
a big amount of experiments run. An easy and quick solution would be to 
choose the experiment that has made the maximum amount of money without 
any kind of boundaries in the output, which in the present experimentation 
consists in using the series starting in January 2004, a window size of 140, a 
hidden layer of 10 neurons, a learning rate of 0.35 and absence of a momentum. 
This conguration has managed to obtain $136.8 out of the initial $100 in 150 
movements, with a success factor of the 55.3%. The problem is the lack of 
stability of the results with similar parameters. For instance, modifying the 
momentum, which is the parameter that probably aects the least to the system 
from 0 to 0.1, the benet of $36.8 turns into a loss of $22.2 of the initial money, 
dropping the quantity to $77.8.This means that the reliability of the result is 
very poor, and has obtained the results quite randomly, without learning much 
from the series. Analogously to the best result in terms of absolute money, a 
24
maximum benet rate of 1.1% per movement has been obtained as well as a 
success factor of the 100% in other experiments, but none of these experiments 
are relevant for the same reason as the one explained before. 
The objective of the analysis is to nd a cluster of experiments with similar 
parameters and decent results in order to give some reliability to the parameters 
used. But before that, a simple postprocess has to be applied to the results, con-sisting 
in considering just the top x percent condent action for each experiment, 
as explained in the post-process section. In Table 4 dierent top percentages are 
compared for a same experiment (window size 140, hidden layer size 35, learning 
rate 0.05, and momentum 0) with good results, together with their nal eect 
on the initial amount of $100, the average ratio of benets per movement and 
the cross-entropy error of the set: 
Top percentage Final money Actions Benet/Action Error 
All actions $93.9 88 -0.06% 1.10 
80% $98.2 74 -0.02% 1.09 
70% $100.3 65 0.01% 1.09 
60% $109.4 57 0.16% 1.08 
50% $114.7 48 0.29% 1.08 
40% $114.5 45 0.31% 1.06 
30% $120.0 37 0.5% 1.00 
20% $117.1 29 0.55% 0.95 
15% $116.5 21 0.73% 0.87 
10% $114.5 15 0.91% 0.81 
5% $110.4 7 1.43% 0.61 
Table 4: Final amount of money and average ratio of benets per action for 
dierent ltered top percentages applied to a same experiment. 
Table 4 illustrates that a greater nal amount of money is not always a 
better result, nor a higher ratio of benet per action. Performing the 100% of 
the predicted actions in this example (the entire buy or sell actions out of the 
150 days), there would be a loss of $6.1. Using the top 10% the nal amount of 
money would be the same as using the top 40% but the average gain are dierent, 
0.91% against 0.31%. Even though the amounts of money are the same, the top 
10 percent is clearly more convenient considering the commission charged by the 
brokers explained at the beginning of this document. Also, as a lower number 
of actions is required, a higher benet per action is reached meaning that less 
risk is taken. The highest ratio per action has been reached by the top 5%, but 
not all the potential of the model would be seized, as using the top 10% or 
15% there is a lower ratio, but more actions are taken into account generating 
a higher amount of money, which is probably worth, at least, consideration. 
If the decision were to invest uniquely in this series, a higher percentage of 
actions would be more recommended. For instance the top 30%, which has a 
good ratio of benets per action using a decent number of actions that would 
25
increase the money without investing too much or too little, and has managed 
to gain the highest amount of money among the tested top percentages, $120.0. 
If more series are taken into account, tighter top percentages would be better 
options, as at every moment one action per series would be available, meaning 
that just using the top actions of each series, a high number actions would be 
performed over time among all the considered series. The Figure 12 shows the 
behavior of each top percentage's eect through time on an initial amount of 
$100: 
125 
120 
115 
110 
105 
100 
95 
90 
Top 20% 
Top 30% 
Top 40% 
Top 60% 
All 
01-09-2013 15-12-2013 31-03-2014 
Figure 12: Timeline showing the behavior of applying dierent top percentages 
to a same results le through time. 
Finally, as the best percentage for one single series was between 20 and 40%, 
it is decided to use the top 35% of the actions, and the rst consequence is 
an augment of the average benets per action, as expected. The action that 
got the highest amount of money with the top 35% got a total of $125.8 in 52 
movements. This is the best result so far, as the best execution using the total 
of actions managed to end up with $136.8 out of 150 movements; around one 
third more money with three times more movements. The parameters used for 
the current best experiment are: starting date, January 2011; window size, 80; 
hidden layer size, 35; learning rate, 0.35; and a momentum of 0.005. 
If instead of picking the best experiment, the set of results gets pruned little 
by little until just a few good results are left, the set of results would end up 
with the following constraints: starting date, only January 2012; window size, 
26
between 90 and 100; hidden layer size, between 15 and 45; learning rate, lower 
or equal to 0.05 and momentums lower than 0.1. This constraints oer a set 
of more than 70 experiments, where the poorest performance ended up with a 
gain, $104. The nal parameters used are not the ones that performed the best, 
but the ones that are in the middle of a set of best performance, which are the 
following: window size, 100; hidden layer size, 15; learning rate, 0.05; momentum, 
0.005. These parameters obtained a total of $109.1 in 11 movements, meaning 
a rate of 0.8% and a success rate of 81.8%; very good results. 
Now the results need to be tested in order to know the real obtained poten-tial, 
and for this purpose a test dataset is available, comprising from 1stApril 
2014 to 31stJuly 2014. Also, the length of the series used for training will be con-stant, 
meaning in this case that instead of starting the series on the 1stJanuary 
2012, it is starting on the 1stAugust 2012, as there is a 7 months lag between 
both datasets. Another thing to take into account is that in the previous val-idation 
dataset 150 days were available in total, and in this test dataset the 
amount of days has shrank to 87 days, as the dataset has shorten from 7 to 4 
months. 
When the parameters are applied to a network generated for the test dataset, 
only one action out of the 87 possible ones is carried out, on the 29thApril, where 
the predicted action was to buy and the market went up a 0.44%, meaning a 
nal amount of money of $100.44, and obviously a success rate of the 100%. 
This result is rather disappointing, as more actions were expected, although the 
benet per action is still fairly good. After this, no further experiments with 
the basic model will be done, as the hybrid systems are expected to be more 
powerful than the present model, so the eort would be put on them. 
5.2 Averages as inputs 
The rst of the proposed modications was the substitution of a sliding window 
by the use of averages of recent values, as explained in section 4.2.1. Coming 
to the execution of these networks, the rst striking part of these experiments 
is the speed improvement when training. While with the sliding window the 
input layer could have sizes of up to 140 neurons, with this model they will 
rarely have more than ten neurons in the input layer. This is a huge reduction 
of the computational time, as every hidden layer's neuron is connected to all 
the input layer's ones. On the other hand the pre-process of the data takes a 
little bit longer, as the averages have to be calculated, but this time is far less 
than the time saved during training. Furthermore, this does not have to be done 
for each single training of the networks, one pre-process of data is needed for 
each input pattern, comprising a lot of networks available to train with dierent 
parameters. 
The scan of parameters is pretty much the same as in the previous model, 
but instead of using a number for the window size, now a list of number is 
used, with patterns as 50-30-20-15-10-5-4-3-2-1, 20-15-10-5, 20-15-10-5-3-1, 100- 
80-70-60-50-40-30-25-20-15-10-5 or 25-20-15-10-5. Other patterns checked were 
50-49-48-...-3-2-1 with dierent lengths instead of 50. However, the best results 
27
were performed by the simplest patterns, like the multiples of 5 up to 20 or 25. 
Multiples of other numbers were tried, but the results were not better than with 
5, so the most of the experiments were stuck to this kind of patterns. 
First of all, it is important to note that the best experiment using the total of 
the actions was using the following parameters: starting date, January 2011; pat-tern, 
100-80-70-60-50-40-30-25-20-15-10-5; hidden layer size, 20; learning rate, 
0.5; momentum, 0.005. It managed to get a total of $134.4 in 87 movements. 
Again, it is not desired to perform the total of the actions, so only the 35% top 
most condent actions would be taken into account. 
Using the top 35% of the actions, the parameters that performed the best 
result are: starting date, January 2008; pattern, 30-25-20-15-10-5; hidden layer 
size, 100; learning rate, 0.5; momentum, 0.02. The nal amount of money was 
$138.8 in 56 movements. Note that, even though the 35% of 150 actions are 
considered, which will suppose a maximum of 53, 56 actions are performed. This 
is due to the fact that dierent actions might have the exact same condence, 
and when this happens with the cut-o action that separates the ones considered 
and the ones discarded, all the equal condence actions are considered as inside 
the threshold. 
Similarly to the basic model, the very best experiment probably will not be 
the most convenient, so another pruning process is carried out over the present 
set of results. Finally, a good bunch of results is obtained using the following 
boundaries: starting date, January 2008; patterns, 25-20-15-10-5 and 40-39-38- 
37-...-4-3-2-1; hidden layer sizes, 100 or lower; learning rates of 0.01 and 0.05; 
momentums of 0.0, 0.005 and 0.02. Note how two extremely dierent input 
patterns provide the best results, while very similar patterns to both of them 
were discarded as the obtained results were not as good. The remaining set is 
formed by more than 50 experiments, and excluding two punctual results that 
ended up with a loss of almost $10, the experiments' gains oscillate between 
$119.9 and $102.1, with an average of $109.27 including them all. 
When it comes to picking the best result, it is generated by the following 
parameters: starting date, January 2008; pattern, 40-39-38-37-...-4-3-2-1; hidden 
layer size, 100; a learning rate of 0.05 and a momentum of 0.005. The experiment 
ended up with a total of $115.1 out of the initial $100, obtained in 45 movements, 
and with a cross-entropy error of 1.05. If the experiment is replicated with the 
test dataset, the results obtained are: a nal amount of money of $106.82 in 24 
movements, 0.28% benet per action, 66.7% of success rate and a cross-entropy 
error of 1.072, quite consistent with the results obtained during the validation 
period. 
28
5.3 Forecasting model 
In this section the results obtained with the second modication will be pre-sented, 
consisting in the replacement of the three neurons output to a one single 
one, changing the model from a classier to a forecasting model, as explained 
in section 4.2.2. The input of the model used is a sliding window, as the modi- 
cations are made to the basic model. 
The scan of parameters is done the same way as for the initial model, as the 
only dierence is the output and it cannot be changed during the experiment. 
As this model is not as easy as previous ones to train because it is more prone to 
diverge, smaller learning rates will be used. Instead of using a smallest learning 
rate of 0.001 as it was before, for this experiments the minimum value of this 
parameter is ten times lower, going from 0.0001 to 0.1. 
After all the experiments have been run, the rst remarkable fact is that to 
the naked eye, the set of results is more prone to gain money, unlike results of 
previous models where the results tended to remain around the initial amount 
of money without any clear tendency. In this set of experiments, the parameters 
that performed the best in terms of nal amount of money are the following: 
starting date, January 2013; window size, 35; hidden layer size, 25 neurons; 
learning rate of 0.1 and no momentum. With a total of $144.4 in the 150 move-ments, 
a MSE of 0.5584 and a success rate of 58%. In the experiments of this 
model, all the initial results are performing the total of 150 actions, as there is 
no possible keep action available, just the numerical prediction of the market 
going up or down, increasing the number of actions performed in comparison to 
classier models. 
Taking into account only the top 35% of the actions the best result is ob-tained 
by: starting date, January 2013; window size, 35; hidden layer size, 25; 
learning rate, 0.1; and no momentum. With a total of $139.6 in 143 movements, 
it can be said that it is a very poor result, rstly because the highest learning 
rate is used as well as the shortest period of data, making it not very likely to 
perform the best; and secondly because has performed 143 actions out of 150 
when taking just the top 35%, meaning that more than 100 predictions along 
the series have the exact same condence, which is not a good symptom at all. 
When pruning the set of results, 88 good ones are obtained with the following 
boundaries: starting date, January 2012; window sizes, from 20 to 35; hidden 
layer sizes, from 20 to 80; learning rate, 0.0002; and momentums up to 0.1. The 
set of results is excellent, except one execution that regardless its momentum 
ended up with $98.4, the amount of money in the set goes from $103.5 in the 
worst of the cases to $127.5 in the best of them, slightly better than the results 
obtained with the classiers. On the down side of the selection of results, it 
can be noted that the average number of movements in the set is higher than 
in previous experiments, mostly 52 movements, and a little bit more for a few 
experiments. An exception is, for a window size of 35 and a hidden layer of 50, 
depending on the momentum, the number of movements is 150, 147, 67 or 77 
meaning that no good distinction has been made by using the top 35%. When it 
comes to the election of the best parameters, a good option is as follows: starting 
29
date, January 2012; window size, 20; hidden layer size, 20; learning rate, 0.0002; 
and momentum, 0.005. These parameters managed to obtain a total of $118 in 
52 movements, meaning an average of 0.32% per action, a success rate of 57.7%, 
and a MSE of 0.5. 
In comparison to previous models, the number of movements is very high. 
In order to mitigate this and try to increase the average benet per action, 
the top 35% will be reduced to the top 20%. Analyzing the same bounded set 
of results, the average benet rates have increased in general, they are still all 
gaining money apart from the same one as before, which now is losing slightly 
less, ending up with $98.5. The typical number of actions has gone down from 
52 to 30, and the result marked as best, now is showing a results of: $114 in 30 
movements, MSE of 0.58, a benets ratio of 0.44% per action and a success rate 
of 60%; good results for such high number of movements. 
Applying the best parameters to the test dataset from 1stApril to 31stJuly 
2014 the results are rather disappointing, as with the top 20% the nal amount 
of money is $98.25, in 17 movements, meaning a loss of 0.1% per action. When 
the rest of the parameters included in the set of good results are run in the 
test dataset, the results do not seem to improve, as now there are more exper-iments 
with losses than with gains. This is due to, apart from the problem of 
extrapolating results that will be explained in section 7.3, during the training of 
both models the series' tendency was bullish, strongly aecting the forecasting 
model's training by biasing it towards buy actions, but during the test period 
it was not. The series high noise made the learning very dicult, ending up in 
a very short range of condences very close to the average, which in these cases 
was positive (bullish series), hence the buy predictions were abundant. 
5.4 Overlapping of data 
In the present section the results obtained from the model explained in section 
4.2.3 where the regular MLP was replaced with a model with data overlapping 
will be shown and explained. One thing to mention is that the learning rates 
used will be bigger than the ones used in previous models, as the maximum 
number of training iterations is limited, while it was not for previous models. 
Also, note that this modication will be applied to the basic model, a classier 
with three outputs and a sliding window as the input of the network. 
When analyzing the results, the parameters performing the best in terms 
of the nal amount of money are the following: starting date, January 2013; 
window size of 30 values; hidden layer size of 50 neurons; learning rate, 0.25; 
and momentum of 0.02. They ended up with a total of $135.3 in 80 actions, 
which is a 0.44% benet rate with a success ratio of 58.8%. With an overall 
picture of all the results and taking the top 35% of the actions, it is remarkable 
how the amount of money varies depending on the starting date of the data. 
One of the possible a priori commented problems was the undertting of the 
data, which would have been overcome with the fact that longer series would 
be used but old data would have been forgotten by the learning of newer data. 
The results have demonstrated this armation, as Figure 13 shows: 
30
125 
120 
115 
110 
105 
100 
95 
90 
85 
80 
2000 2002 2004 2006 2008 2010 2012 2014 
Figure 13: Fluctuation of the experiments' money obtained from September 
2013 to March 2014 considering series starting in dierent points of time. 
As can be seen in the above gure, no big dierences are appreciated with 
naked eye along the dierent tested starting dates. Networks with series start-ing 
in January 2000 are trained for more than 3500 epochs before the validation 
datasets is taken into account, whilst network trained from January 2013 are 
trained for barely 150 epochs, being their results not that dierent. This is be-cause 
for older series the network updates its weights with the newer samples 
forgetting older samples. Also, it is proven that networks learn more start-ing 
with random weights in their connections [14], and due to the high noise 
the training of this data is not far from a random initialization, minimizing 
the training carried out from older samples. Note again that for all the experi-ments 
explained in this document, the networks' connections weights have been 
initialized with random values between -0.1 and 0.1. 
Using the top 35% most condent movements, the best parameters change 
to: starting date, January 2008; window size, 25; hidden layer size of 60 neurons; 
learning rate of 0.45 and a momentum of 0.02. These parameters have obtained 
a total of $121.2 in 27 movements, meaning a very good benets rate of 0.79% 
per movement. As in previous sections, when trying to prune the set of results 
in order to minimize the loss, the remaining set is quite big, as 144 results are 
remaining with the following constraints: starting date from January 2006, 2008 
and 2010; window size of 60, 80 and 100 values; hidden layer size of dierent 
values between 20 and 50; learning rate between 0.08 and 0.12 and the absence 
of a momentum. In this set the results are not so great as they oscillate from a 
maximum loss of $0.9 to a maximum benet of $4.5. 
31
Finally, the chosen best result is created out of the following parameters: 
Starting date, January 2010; window size, 60 values; hidden layer size, 20 neu-rons; 
learning rate of 0.1 and a momentum of 0. The nal amount of money 
obtained is $103.1 in 6 movements, a very good positive benet ratio of 0.52% 
and a success rate of 83.3%, but with very few number of actions performed, 
only the 4% of the 150 possible movements. With this group of parameters, the 
network trained for the test set obtained $100.8 in two movements, a benets 
ratio 0.4% of and a success rate of 50%. As the basic model, the results are a 
little bit poor in terms of number of actions carried out, despite the benet ratio 
is still good. Again, no more experiments will be performed, as the following 
hybrid systems are a priori more powerful models. 
5.5 Summary 
After analyzing the basic model, the two simple modications and the model 
with overlapping of data, some early conclusions can be obtained. First of all, in 
Figure 14 a comparison of the best results obtained with the used top percentage 
of each model can give an idea of the maximum potential of each experiment. 
The top percentages in terms of actions' condence are the 35% for all the 
models but the forecasting one, which is using the 20%. 
140 
135 
130 
125 
120 
115 
110 
105 
100 
Basic 
Averages 
Forecasting 
Overlapping 
01-09-2013 15-12-2013 31-03-2014 
Figure 14: Comparison of each of the experiments' best result through time in 
the second validation dataset. 
32
About the previous gure must be mentioned that the forecasting model 
has the advantage of performing more than 140 actions, as explained in section 
5.3, while the basic and averages models are performing around 55 movements 
and the overlapping one only 6. This means that even though the forecasting 
model has obtained the greatest amount of money, other models are probably 
better, as their averages of benets per action are greater. When it comes to 
the chosen experiment of each model, a comparison of the four models can be 
seen in Figure 15. 
116 
114 
112 
110 
108 
106 
104 
102 
100 
Basic 
Averages 
Forecasting 
Overlapping 
01-09-2013 15-12-2013 31-03-2014 
Figure 15: Comparison of the well-generalized results of each of the four basic 
models in the second validation dataset. 
Using the information shown in the present subsection together with the pre-vious 
results of the dierent models, the forecasting model can be discarded on 
behalf of the other basic models. Also, the overlapping model's results have not 
been bad, fact that provoked the evolution of the model towards the explained 
hybrid models. Both the basic and the averages as inputs models behaved quite 
well and seemed to learn some useful patterns. 
As a last important point, the comparison of the dierent models' results 
through the test dataset is shown in Figure 16. 
33
108 
106 
104 
102 
100 
98 
96 
94 
Basic 
Averages 
Forecasting 
Overlapping 
1-4-2014 1-6-2014 31-7-2014 
Figure 16: Comparison of the chosen parameters congurations of the dierent 
models applied to the test dataset. 
5.6 Hybrid models 
In this subsection the basic results obtained in both hybrid models will be 
explained. These models are obviously more powerful than the model using 
overlapping of data, as they are a extreme simplication of it, which is doing 
only one training step per sample and using a batch size of one. They are also 
more powerful than the basic model, as this one should be able to perform, at 
least as good as it, due to the use of more recent data to predict each sample. 
After the results obtained and explained in previous sections, it is decided to 
discard the forecasting model, and analyze these hybrid models with both a 
sliding window and averages as inputs. 
When running these networks, one of the rst things to note is that in gen-eral 
the number of actions performed in both models when short series are used 
is very low, generally not more than 5 actions out of the possible 150. To mit-igate 
this, what has been done is decrease the threshold used to consider an 
action bullish or bearish, which was ±0.65%, to ±0.60%, and run some more 
experiments. This way, more actions would be performed, as the number of keep 
actions is decreased, for instance, an increase of 0.62% that would have been 
considered as a keep action, now is considered a buy action. Thresholds lower 
than 0.6 were tested as well, but they were mostly leading to more unstable 
34
networks where the total of the actions were either buy or sell, but not a com-bination 
of both with a proper learning of patterns. In Figure 17 the number of 
performed movements is shown for dierent starting dates through time, where 
is demonstrated how the number of performed movements increases with older 
starting dates, tending to a maximum average of around 100, which is exactly 
two thirds of all the possible actions. 
120 
100 
80 
60 
40 
20 
0 
2010 2010.5 2011 2011.5 2012 2012.5 
Figure 17: Number of performed actions against dierent starting dates of the 
series. 
A new issue that comes up with this models is that the condence threshold 
used in previous models is not that reliable now, as dierent networks are used to 
predict each of the samples. Compare dierent networks that have been trained 
with dierent models and a dierent number of epochs is not that simple. In 
general all the actions will be performed without bounding the condence, as not 
so many are normally carried out, and in section 6 a method for choosing when 
to perform the action or stay away from the market will be shown, although not 
from a technical point of view. 
35
5.6.1 With explicit validation 
The rst of the hybrid models is explained in section 4.3.1, which was very 
similar to the basic model, but generating a new network for each new sample, 
using data as recent as possible. As mentioned earlier, the main disadvantage 
is the time taken searching for the ideal parameters, one per sample in the 
second validation period. The experimentation was reduced a little bit by using 
parameters not too dierent from the good ones obtained in the basic model, in 
order to nish the experiments in a reasonable amount of time. 
The range of parameters used was the following: the thresholds used for 
separating the classes were ±0.60% and ±0.65%; the starting date oscillated 
between July 2009 and July 2012; window sizes between 40 and 140 values and 
few of the best patterns in the averages model; hidden layer sizes between 10 
and 100 neurons; learning rates between 0.003 and 0.011 and momentums lower 
or equal than 0.05. The parameters performing the best results in terms of the 
nal amount of money were the following: starting date, July 2010; window size 
of 80 neurons and hidden layer of 25; learning rate of 0.003 and a momentum 
of 0.005. The average best epoch of this model's executions was 791 epochs, 
making a total of $149 in 106 movements, a rate of 0.38% per movement and a 
success rate of 63.2% 
In total, 794 experiments were run, with pretty good results, as 577 con- 
gurations managed to earn money, while, 217 ended up with a balance lower 
than $100, meaning that the 72.77% of the experiments were positive, while 
in previous experiments this percentage was closer to 50%. One of the most 
inuent parameters is the starting date, and according to it the results set was 
bounded to keep experiments just which starting dates from November 2011, 
January 2012, and February 2012. Setting the class threshold to ±0.60% as well, 
the number of experiments gets reduced to 288, with only 10 of them ending up 
with a negative economic balance. 
When choosing the best result, a group of experiments got outstanding re-sults, 
formed by a window size of 140, hidden layer of 20, momentum of 0.05 
and dierent starting dates and learning rates. Six experiments are in this set, 
with the same results, $104.5 in 5 movements, a very good ratio of 0.88% per 
action. The ratio is outstanding, but take a conguration with such small num-ber 
movements might be risky, mainly when a priori better options show up in 
the rest of the set. 
Finally, the sample tagged as best was the one were each parameter per-formed 
its best in the total set. These parameters were as follows: starting date, 
January 2012; window size, 80 neurons; hidden layer size, 60 neurons; learning 
rate of 0.07 and a momentum of 0.05. This execution managed to earn a total 
of $13.6 out of the initial $100 in 32 movements, meaning a benets rate of the 
0.4% and a success rate of 65.6%. The average best epoch of the 150 trained 
networks for the prediction was of 391.6 epochs. 
36
When applying these ideal parameters to the test dataset, the starting date 
is moved to August 2012 in order to keep the series length constant, and the 
results obtained are as follows: Average best epoch, 152.24; a total of $102.52 
meaning a benet of $2.52 in 5 movements; a benets rate of 0.505% and a 
success rate of 80%, 4 out of 5 actions were right. 
5.6.2 With implicit validation 
The last of the results' experiments correspond to the second hybrid model where 
the use of the rst validation model is completely skipped when training the net-works 
used to predict new values, as explained in section 4.3.2. Apart from the 
already explained advantage of using data for training which is chronologically 
closer to the values that are predicted, another positive fact of the present model 
is that the results obtained in the second validation dataset should be more re-liable. 
This is because the network that minimizes the cross-entropy error is not 
used, but its features are applied to a dierent set of data, which is increasing 
the importance of both the parameters and the number of training epochs. The-oretically, 
this is a good rst step to mitigate the problem of the extrapolation 
of results present in other models, and that will be explained in section 7.3, as 
something like a pre-testing phase is being done before the actual test of the 
results. 
With the current model, the ranges of parameters used for scanning are 
pretty much the same as in the hybrid model with explicit validation, as theo-retically 
the results should not be too dierent. Similarly to the previous model's 
results, very few actions are performed when ±0.65% is the threshold used to 
separate the three types of classes, so again the most of the experiments will be 
performed using ±0.60% as a divider threshold for deciding the actions. 
When coming to the actual results, this models shows fairly good general 
results, as 182 out of 204 experiments obtained a positive rate, meaning that the 
89.2% of the experiments managed to gain some money, while the 10.8% ended 
up with less than the initial amount. Again, the results show certain reliability 
on this method, as the dierences between similar parameter congurations are 
very smooth. Also, the average cross-entropy error is low, being the maximum 
not more than 1.08, while in other models it was not uncommon to have samples 
with extremely high errors meaning that no convergence was present. 
The results improve when only experiments using as their starting date the 
1stNovember 2011 are taken into account. In this new set of 50 results the worst 
of the cases managed to get $103.9 in 32 movements, meaning a rate of 0.12% 
per movement and its cross-entropy error was of 1.0602. In this set obviously 
the 100% of experiments ended up with a money gain, as the minimum gain 
was $3.9, while the maximum amount of money was obtained with the following 
parameters: starting date, November 2011; window size, 90 values; hidden layer 
size, 40 neurons; learning rate of 0.09 and a momentum of 0.05. With this 
parameters the cross-entropy error went down to 1.057, and the amount of 
money obtained was of $112.6 in 22 movements, meaning a benets ratio of 
0.55%. The success rate was of the 63.6%, and the average best epoch while 
37
training was of 294 iterations. 
As in previous sections, these parameters need to be used in the test dataset, 
with the only dierence of the starting date, which is moving from November 
2011 to June 2012 in order to keep the length of the series. When testing the 
series, there is a big drop of the results, as the money obtained was of $100.3, 
but using only one movement out of the 90 possible actions, which means that 
the ratio of benets per action is not too bad, 0.3% per movement. In this case, 
the cross-entropy error went up to 1.063, which is still good while the average 
best epoch was of 78 iterations per network trained. The results are not as good 
as expected after the optimism generated in the validation period, mainly due 
to the number of actions is too low, as in some previous cases. 
Finally, after all the models have been considered and the results showed, 
and as a continuation of the summary explained in section 5.5, it can be said that 
in general terms the hybrid models have been more reliable than the basic ones. 
Simpler models as the basic one or the one modied for using averages as inputs 
have managed to get more money in both the validation and test datasets, 
but the hybrid models have managed to keep the cross-entropy error lower 
avoiding irregularities. Also more networks were taken into account for each 
series execution, meaning that punctual good experiments might be obtained 
luckily, but this is not that likely to happen when considering the average of 
150 networks. Lastly, and because of the problem's nature, the reliability is 
something extremely critical, meaning that the models with a more practical 
application would be the hybrid ones. Concretely the one using an implicit 
validation dataset, as its results' might be more easily trusted due to the use 
of an extra series of values for training, which is giving something similar to an 
extra test phase. 
38
6 Combination of models 
Previous sections have shown how one single nancial series can be predicted 
using dierent methods, with their corresponding results. It has been demon-strated 
that normally when not many actions are performed in a series the 
results tend to be better than when using a lot of them, either choosing the 
ones by the condence or by reducing decision boundaries for buy/keep/sell 
classes. Table 4 is clearly illustrating this. But, what if more actions want to be 
performed without losing performance? One of the solutions applicable in a real 
case would be the use of more than one series, and this is what is going to be 
explained in this section. 
The series used to evaluate all the methods of the present document was 
Abengoa Abertis (ABE.MC in yahoo nance), as mentioned at the beginning 
of it. In order to expand the experiments the rest of the series included in the 
IBEX35 will be used, as they have close behaviors due to their strong inuence 
by the Spanish economy. The idea of this section is to show a simple demonstra-tion 
of how to combine dierent series in a practical way, so the training will not 
be as deep as in previous sections, and the parameters used for the networks' 
training of the dierent series will be the same for them all. The list of series 
considered represented by their yahoo nance codes is the following: 
ABE.MC BME.MC GAS.MC MAP.MC SAN.MC 
ACS.MC CABK.MC GRF.MC MTS.MC SCYR.MC 
AMS.MC DIA.MC IAG.MC OHL.MC TEF.MC 
ANA.MC ENG.MC IBE.MC POP.MC TL5.MC 
BBVA.MC FCC.MC IDR.MC REE.MC TRE.MC 
BKIA.MC FER.MC ITX.MC REP.MC VIS.MC 
BKT.MC GAM.MC JAZ.MC SAB.MC 
Table 5: List of the stock market series' codes considered in this model. 
The IBEX35 index is composed by 35 dierent stocks, but as ABG-P.MC 
stocks are not participating in the Spanish exchange market for more than two 
years, the easiest solution was to stop considering it, as the corpus of 34 series 
is big enough. 
When starting the technical part the rst issue to appear comes from the 
splitting of data. For the initial series dierent thresholds as ±0.60 or ±0.65 were 
tested manually, but now some problems appear when using the same threshold 
for all the series, as some good threshold might mean something completely 
dierent for another series. The solution applied for this issue consists in moving 
automatically the threshold for each starting date of each series in order to 
minimize the standard deviation of the three classes (buy, keep, sell). This is 
done by an small algorithm that assumes the greatest of the series' dierences 
in absolute value as the starting threshold and moves is down repeatedly until 
the standard deviation is minimum. For instance, assuming the following simple 
series shown in Table 6: 
39
Day Dierence 
1 1.5% 
2 -0.6% 
3 0.0% 
4 -1.1% 
5 0.7% 
Day Dierence 
6 1.9% 
7 -3.6% 
8 -0.9% 
9 4.0% 
10 1.2% 
Table 6: Example series with daily variations through 10 days. 
In the example above a series of ten values is considered, so this would be the 
number of iterations needed. The algorithm would proceed as Table 7 shows: 
Iteration Threshold Buy Keep Sell StD Best StD Best Threshold 
0 ±1 0 10 0 5.77 5.77 ±1 
1 ±4.0 1 9 0 4.93 4.93 ±4.0 
2 ±3.6 1 8 1 4.04 4.04 ±3.6 
3 ±1.9 2 7 1 3.21 3.21 ±1.9 
4 ±1.5 3 6 1 2.52 2.52 ±1.5 
5 ±1.2 4 5 1 2.08 2.08 ±1.2 
6 ±1.1 4 4 2 1.15 1.15 ±1.1 
7 ±0.9 4 3 3 0.58 0.58 ±0.9 
8 ±0.7 5 2 3 1.53 0.58 ±0.9 
9 ±0.6 5 1 4 2.08 0.58 ±0.9 
10 ±0.0 6 0 4 3.06 0.58 ±0.9 
Table 7: Example execution of the algorithm for choosing the classication 
threshold of a series according to the standard deviation of the classied sam-ples. 
In the example can be seen that the threshold performing the best distribu-tion 
of data is 0.9, which is bringing the standard deviation down to 0.58, and 
it would be the one used for this ten values series. 
Once the splitting of classes is understood the methodology for choosing the 
action to perform will be considered. Few dierent formulas will be tested, start-ing 
from simple ones as just taking the action with the maximum condence, to 
more complex ones, as using the last 50 actions and calculate the nal money by 
simulating the series, or the average benets ratio. These dierent approaches 
will be explained more in depth in the following subsection. 
As a last concern of this model's preparation, the training methods comes 
up to scene. The same methods used for previous models will be applied here, 
given a set of parameters, all the samples forming the validation dataset will be 
predicted and their results summarized. The average of these result's summaries 
will be used as a measurement for the given set of parameters, in order to choose 
the most suitable set of them. 
40
PFC
PFC
PFC
PFC
PFC
PFC
PFC
PFC
PFC
PFC
PFC
PFC
PFC
PFC

More Related Content

What's hot

Textual analysis of stock market
Textual analysis of stock marketTextual analysis of stock market
Textual analysis of stock market
ivan weinel
 

What's hot (14)

Automation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep LearningAutomation of IT Ticket Automation using NLP and Deep Learning
Automation of IT Ticket Automation using NLP and Deep Learning
 
Stock market analysis
Stock market analysisStock market analysis
Stock market analysis
 
Textual analysis of stock market
Textual analysis of stock marketTextual analysis of stock market
Textual analysis of stock market
 
Stock market prediction using Twitter sentiment analysis
Stock market prediction using Twitter sentiment analysisStock market prediction using Twitter sentiment analysis
Stock market prediction using Twitter sentiment analysis
 
Stock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural NetworkStock Market Prediction Using Artificial Neural Network
Stock Market Prediction Using Artificial Neural Network
 
Demand forecasting
Demand forecastingDemand forecasting
Demand forecasting
 
Aliano neural
Aliano neuralAliano neural
Aliano neural
 
Prediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom IndustryPrediction of customer propensity to churn - Telecom Industry
Prediction of customer propensity to churn - Telecom Industry
 
STOCK MARKET PREDICTION
STOCK MARKET PREDICTIONSTOCK MARKET PREDICTION
STOCK MARKET PREDICTION
 
An Introduction to Anomaly Detection
An Introduction to Anomaly DetectionAn Introduction to Anomaly Detection
An Introduction to Anomaly Detection
 
Trading outlier detection machine learning approach
Trading outlier detection  machine learning approachTrading outlier detection  machine learning approach
Trading outlier detection machine learning approach
 
Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics project
 
Approach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule ThresholdsApproach to BSA/AML Rule Thresholds
Approach to BSA/AML Rule Thresholds
 
Stock market prediction technique:
Stock market prediction technique:Stock market prediction technique:
Stock market prediction technique:
 

Similar to PFC

Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
mayurik19
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertation
Kurt Portelli
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
Denis Zuev
 
Valkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACTValkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACT
Aart Valkhof
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
Gustavo Pabon
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
Gustavo Pabon
 

Similar to PFC (20)

A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORK
 
Market microstructure simulator. Overview.
Market microstructure simulator. Overview.Market microstructure simulator. Overview.
Market microstructure simulator. Overview.
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 
main
mainmain
main
 
MASci
MASciMASci
MASci
 
Data mining seminar report
Data mining seminar reportData mining seminar report
Data mining seminar report
 
project(copy1)
project(copy1)project(copy1)
project(copy1)
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertation
 
Diplomarbeit
DiplomarbeitDiplomarbeit
Diplomarbeit
 
Corporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniquesCorporate bankruptcy prediction using Deep learning techniques
Corporate bankruptcy prediction using Deep learning techniques
 
edc_adaptivity
edc_adaptivityedc_adaptivity
edc_adaptivity
 
143297502 cc-review
143297502 cc-review143297502 cc-review
143297502 cc-review
 
ilp
ilpilp
ilp
 
Sensfusion
SensfusionSensfusion
Sensfusion
 
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
New_and_Improved_Robust_Portfolio_Selection_Models_ZUEV(dphil)
 
Valkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACTValkhof, Aart 0182737 MSc ACT
Valkhof, Aart 0182737 MSc ACT
 
ThesisB
ThesisBThesisB
ThesisB
 
main
mainmain
main
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 

PFC

  • 1. The Use of Neural Networks for Tendency Prediction in Financial Series September 20, 2014 Estudio del uso de redes neuronales en la predicción de tendencias en series de nanzas Proyecto n de carrera Universidad Politécnica de Valencia Escuela Técnica Superior de Ingeniería Informática Author: Juan Francisco Muñoz Castro Director: Salvador España Boquera Co-director: Francisco Zamora Martínez i
  • 2. Abstract In the present project, a comparison of dierent types of articial neu- ral networks has been used to analyze their behavior with noisy time series prediction, with the goal of maximizing the benet obtainable by investing in them. To do so, a wide range of datasets has been used, containing stock market prices until September 2014 and starting from January 2000 on- wards. The starting experiment has been a regular multilayer perceptron using a sliding window of the latest values as the input of the network and three outputs representing three possible actions as buy, sell or keep. Fur- ther experiments have been tested, such as the replacement of the three outputs classier by a single one, converting the system in a forecasting model with only one output, or the use of dierent averages of recent val- ues instead of a simple sliding window as the network's input. Also, it has been tested the use of a single dataset from where each sample is used rst to test and validate, and to train the network later on in a new step instead of the traditional way of training-validation-test splitting of data. Finally, two new models that seize all the data have been tested, one with a specic period of data validation, and the other one with an implicit period, as it has been skipped by doing some networks pre-training. After a comprehensive applying of these methods to the time series, certain pre- dictability was found. Some networks were able to predict the direction of change for the next day with an error rate of around the 40%, which in some optimistic cases decreases to about 30% when rejecting examples where the system has low condence in its prediction. A practical simu- lation has been explained, showing an average gain close to the 0.33% by acting the half of the times. ii
  • 3. Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Stock market basics . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Structure of this report . . . . . . . . . . . . . . . . . . . . . . . 2 2 Time series prediction 3 2.1 Articial Neural Network basics . . . . . . . . . . . . . . . . . . . 3 2.2 Dierent techniques . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Experimentation process 7 3.1 Tools used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Basic strategy and data used . . . . . . . . . . . . . . . . . . . . 7 3.3 Performance measurement . . . . . . . . . . . . . . . . . . . . . . 8 3.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.5 Post-process of the data . . . . . . . . . . . . . . . . . . . . . . . 11 4 Models used 13 4.1 The basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Variants of the model . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.1 Sliding window vs Averages as inputs . . . . . . . . . . . . 14 4.2.2 Three-class classier vs Forecasting model . . . . . . . . . 15 4.2.3 Traditional MLP vs Model with data overlapping . . . . . 17 4.3 Hybrid model with overlapping of data . . . . . . . . . . . . . . . 18 4.3.1 Explicit validation dataset . . . . . . . . . . . . . . . . . . 19 4.3.2 Implicit validation dataset . . . . . . . . . . . . . . . . . . 20 5 Results 23 5.1 Basic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.2 Averages as inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.3 Forecasting model . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.4 Overlapping of data . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.6 Hybrid models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.6.1 With explicit validation . . . . . . . . . . . . . . . . . . . 36 5.6.2 With implicit validation . . . . . . . . . . . . . . . . . . . 37 6 Combination of models 39 6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7 Problems encountered 46 7.1 High noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.2 Overtting and Undertting . . . . . . . . . . . . . . . . . . . . . 47 7.3 Extrapolation of the results . . . . . . . . . . . . . . . . . . . . . 50 8 Conclusions 52 9 Future work 54 iii
  • 4. 1 Introduction 1.1 Motivation Since the existence of a stock market exchange, this has been one the most important indicators or even predictors of the economy in worldwide terms. With an average daily trading value of 169 billion dollars during 2013 just in the New York Stock Exchange, this indicator shows how important for the economy is. Because of this, so many attempts to predict it have been made, some more successful than others, but never with outstanding results. In fact, the idea that the market is completely unpredictable is widely accepted, mainly because its value is driven by news, which is unpredictable by denition, and would make following values of the stock market depend exclusively on the present and future, never on the past. This idea is asserted as the ecient-market hypothesis (EMH), which states that stocks always trade their fair value on stock exchanges, making it impossible for investors to either purchase undervalued stocks or sell stocks for inated prices. In contradiction to the EMH, there are two main types of analysis: funda-mental, which is the process of looking at a business at the basic nancial level; and technical, which is the methodology for forecasting the direction of prices through the study of past market data. Numerous articles have been published, based in the technical analysis, sev-eral of them using Articial Neural Networks, which show certain predictability in these nancial series in contrast to the previous statement, which enhances the initial motivation of this project. In this area is where the present project has been developed, trying to predict the tendency of dierent recent stock mar-ket instruments, comparing the results for each technique as well as determining the predictability of the dierent instruments coming from dierent scopes. Apart from merely technical reasons, there is the basic root motif consistent on the nancial gain. A system able to determine the trends of the market with good reliability is an extraordinary tool that many investors and researchers are continuously looking for in the search for the high benets of their investments. 1.2 Objectives The objective of this project is to experiment, analyze and explain how dierent types of Articial Neural Networks can predict future values of nancial series, based on the technical analysis that simply uses the historical prices. Provided with datasets of daily market data, will be assumed that one action can be carried out per day at the stock market opening time that will be canceled at the end of the same day. With this premise, the main objective is to maximize the benet by investing a given amount of $100, in terms of the information used as the ratio of benet per movement or the percentage of success from a nancial angle. From a more technical perspective we will analyze the behavior of the parameters that aect the evolution of the Neural Networks, both input parameters and output measurements. 1
  • 5. 1.3 Stock market basics First of all the denition of a stock exchange will be given, which according to Wikipedia is a form of exchange that provides services for stock brokers and traders to trade stocks, bonds, and other securities. There are two possible ways of taking part in the stock market: ˆ Buying stocks: the current price of the stock is paid and whenever this stock is sold the money worth of that stock is simply given to the investor, so if the stock has increased its price this dierence will be gained, if it has decreased its price, the dierence will be lost. ˆ Short selling stocks: in this case, the investor is lent stocks that are sold instantly, with the commitment that he will have to give these stocks back; therefore it will be necessary to eventually buy them again in order to return them to the lender. In colloquial terms, it can be said that this is a betting for the stock market to go down; the lower down it goes, the more benets the investor gets, but also the further up it goes, the more money will be lost. ˆ It could be considered as a third action to stay away of the market as there is no need to always be actively participative in the market, and this is probably the most important part of investing; knowing when to stay away. This way money is kept, so there is no risk as well as no possible benet. Any non-professional investor can freely buy and sell any kind of instruments using a broker as an intermediary, which typically is a computer software. There are plenty of available programs online, and they mostly work with commissions, meaning that they keep a small amount of money for each transaction the client makes. This is one of the main obstacles found if someone wants to get hands on with non-professional investing, the initial negative odds. A standard broker charges around the 0.01% of each transaction, either if buying or short selling stocks, to have an initial idea of the taxes these programs operate with. On one hand, in the long term this will become a large amount of money taken, and on the other the fact that a random investing strategy when the market keeps stable in a long term, will be very prone to end up with loses. 1.4 Structure of this report The present document will be divided as follows: In section 1 a brief introduction has been exposed, together with some basics of the stock market; section 2 will explain the basics of the time series prediction, mainly regarding neural networks. The experimentation process will be explained in section 3 and the models used during this process in section 4, with their results in section 5. A combined model will be explained in section 6; an overview of the problems found will be shown in section 7; and at the end of the document the conclusions and some future interesting work will be exposed, in sections 8 and 9. 2
  • 6. 2 Time series prediction 2.1 Articial Neural Network basics Before starting to go through the background of the dierent approaches, a quick overview of Articial Neural Networks (ANN) should be given, due to it being one of the basic common tools of several approaches. An ANN is a com-putational model capable of machine learning generally presented as systems of interconnected neurons which can compute values from inputs. These neurons harbor numerical values and are typically grouped in sets called layers. A min-imum of two layers is needed to set a neural network, one to read the inputs, with one neuron per input value; and another one to write the outputs, with one neuron per output as well. One of the most popular types of network is the multilayer perceptron, where every neuron of a layer is connected in only one direction to every neuron of the following layer, so that each neuron is reached by the all neurons of the predecessor layer and reaches all the neurons of the following layer if any. Every layer that is not the input or output one is called a hidden layer, and an ANN can consist of one or more hidden layers. Figure 1 shows the architecture of these multilayer articial neuron networks: Figure 1: Basic Articial Neural Network with one hidden layer. In the Figure is shown an Articial Neural Network with an input layer, X of n neurons; a hidden layer Z with p neurons; and an output layer Y with m neurons. Each single connection contains a weight, shown in the graph as V or W and two subscripts representing the reached and the reacher neurons' 3
  • 7. positions in their corresponding layers. To compute the output layer neurons' values, the following formula applies to every neuron of every layer sorted in order from the input to the output ones, and updating their values pi by a after the formula is calculated: iX=m a = f( i=1 (pi wi) + b) This way each layer requires the completion of the predecessor layer's com-putations. Also, a nal formula f is applied to the output value with the aim of reaching a better or quicker learning. Typical formulas are linear or sigmoid [6], in order to emulate the behavior of the step function, which will provide a more aggressive learning, as it would always be either 1 or 0. The formula of the sigmoid function is as follows: (t) = 1 1 + e
  • 8. t Where the greater the beta is, the closer the sigmoid is to the ideal step function, but a too large beta will lead to a longer computational time. In the Figure 2 the dierence between a sigmoid function with beta=1 and the step function can be veried. Figure 2: Sigmoid (left) and step (right) functions. As the training of the networks is a bit more complex and is not essential for the understanding of this document, we will not go into too much mathematical detail. Just to mention that the most common way to train the network is with a backpropagation of errors, starting from the output until the input layer, where a gradient of a loss function is calculated with respect to all the weights in the network. This gradient is afterwards used to update the weights of the connections, together with some parameters such as the learning rate or the momentum, which tunes the network with the aim of transforming it into a more accurate one. Further information can be found in plenty of books and articles [4][10][11][13]. The learning rate is a ratio that is multiplied by the gradient to update the weights of the neurons. It inuences the quality and speed of the 4
  • 9. training: the greater the learning rate is, the quicker the network will learn, but the lower the ratio, the more accurate training. In Figure 3 a small learning rate is shown in the left, where the problem converges very slowly, and a learning rate too big is shown in the right image, where the problem diverges. Both dierent learning rates are applied to a same problem where the aim is to nd the minimum error (x axis) with dierent results. Figure 3: Repercussion of a small value for the learning rate (left) and a too large one (right) over a training curve. The momentum is a parameter that represents what could be called the inertia of the learning, extending the actual learning in a proportion given by this parameter. A momentum equal to zero does not aect the original learning of the net, and a greater momentum allows it to train faster and might avoid the network getting stuck in local minimums. On the other hand, a momentum too big means that the ANN will learn too fast and will probably miss the global minimum that the network is looking for. The utility of the neural networks mainly resides in the fact that they can be used to infer a function from observations. Dierent scopes where articial neural networks can be applied are as wide as pattern recognition, game-playing decision making, spam ltering, sequence recognition and many more. 2.2 Dierent techniques For the concerning problem, a lot of approaches have been made in order to predict the tendency of markets. In terms of Articial Neural Networks, most articles focus on Recurrent Neural Networks, which are a kind of network where connections between units form a directed cycle. This creates an internal state which allows the network to exhibit dynamical temporal behavior [7]. These kinds of networks are suitable for predicting time series, but their main with-drawal is the diculty they have in converging, which becomes a bigger problem with high noisy series such as stock market ones. Dierent processes have been applied to these networks to improve their results like self-organizing maps or grammatical inference [9]. 5
  • 10. Other techniques that dier from ANN have been used as well, such as Sup-port Vector Machines [12], Genetic Algorithms [8], or combinations of dierent models, techniques and approaches in order to maximize the results. Popular models in this area to combine results are boosting and bagging [15], which are like an add-on of the initial models to try to perform better. 2.3 Proposed approach After verifying several types of approaches together with their results and com-plexity, the decision made was to start the experiments with a simple regular multilayer perceptron (MLP), using backpropagation as its training method. Just a regular neural network is a relatively simple tool that has a good pre-dictive potential if the data is well organized, and this together with the fact none of the methods mentioned in the above section have shown outstanding results even though they are more complex, leads to the use of an initial MLP to perform this task. Afterwards, some modications will be added to the basic model with the objective of improving its performance, which will be explained in further sections, and comprise things as replacing the initial input layer from a list of values by dierent averages of the values, or the output layer from a binary vector to a single rational number. Additionally, modications in the architecture will be considered, as well as an exhaustive scan of the dierent parameters that might aect the results obtained. Slightly more complex mod-i cations will be made, like the substitution of the traditional way of splitting the data to train and test the network by a new model where an overlapping of samples is considered with the goal of seizing the data better, or a hybrid system in between the traditional model and the overlapping one. 6
  • 11. 3 Experimentation process 3.1 Tools used To carry out all the experimentation of the project, many tools and elements have been considered and several of them used. One of the most important parts, as mentioned in previous sections, has been the platform of Yahoo Finance, which gathers historical data from the main stock markets and allows anyone to download it. About the software used, the rst attempt was to use Theano, a Python library, but after a few experiments it was decided to change to APRIL-ANN [1], which is based in the scripting language LUA [2]. It was mainly chosen because it is developed solely for working with Articial Neural Networks, in order to improve in terms of eciency. It was additionally chosen due to the fact that both director and co-director of the present project are taking part in its development. All the pre-process and post-process of the data has been done with Python, for the mere reason of familiarity with it and it being a powerful scripting language. Dierent external Python libraries have been used for dierent purposes, such as the library urllib for downloading the data from the yahoo platform, the library csv for working with such les, multiprocessing and threading to speed up the process or typical handy Python libraries as math or collections. These were all set in a Intel Core 2 Duo (2,00 GHz) with 4 GB RAM, running Ubuntu Linux 13.10. 3.2 Basic strategy and data used The stock market oers many possibilities, permitting investors to buy, sell and keep whatever and whenever he wants. It is for this reason that is needed to put some boundaries in the system before establishing a forecasting model, so that it can be studied more easily. Given that one of the most popular sources of public historical stock market data is Yahoo Finance, this platform will be used, as it has daily data available since the early nineties. The daily data provided by Yahoo Finance has for each day its day, opening value, max value, min value, closing value, its real volume and an adjusted closed value, which is the closing value modied when dividends are paid. Back to the system, the boundaries will set as follows: ˆ From the historical data, only the date and the percentage of change will be used, which is calculated as the relative dierence between the adjusted closing value of one day in respect to the same value of the day before. ˆ The investing strategy will be to perform an action at the opening time and keep it until the closing time of the same day. This means that the adjusted closing value of the day before will be used as the initial stock value and the adjusted closing value of the current day as the last value. ˆ The focus will be put in trying to predict the direction of change of the market, instead of predicting the value itself. Empathizing on the practical 7
  • 12. nancial way more than in the precision of the predictions, although both measurements are closely related. ˆ All the historical data until one point is available to predict the direction of change of that point, meaning that if tomorrow's change wants to be predicted, all the data until today would be available. ˆ The initial date used as the beginning of the data will start from dierent points in time for dierent experiments, but it will never be older than 1st January 2000. ˆ For the rst experiments a stock from the Spanish stock market IBEX 35 has been used, arbitrarily chosen by alphabetical order as a regular stock, Abengoa Abertis. Other series have been analyzed below to get a better understanding of the series' predictability. 3.3 Performance measurement The rst thing that has to be set is a common evaluating model for all the ex-periments. Talking about ANN, the main measurement is the error obtained in validation and test datasets, which represents how well the network has learned the samples of these datasets. Typical errors that have been used in this ex-periment include the Mean Squared Error (MSE) which is the square of the dierence between the estimator and what is estimated; or the cross-entropy error method, which gives an estimation of how similar two distributions are. From a more nancial point of view, dierent ways of measurement are needed, which go further than purely mathematical ones. One of them is the percentage of success, which basically is how often the selected action is right. The main disadvantage of this method is that not all the actions have the same eect, for instance, assuming four days when the market goes up 0.1% and a fth when it goes down 3.4%. The success rate would be of the 80%, but more than the 3% of the money would have been lost. This is not the most common of the cases, but it is something worthy of being considered. Another way of measuring the eectiveness, which has been used in several articles regarding stock market prediction is a simulation of the actions. Sup-posing an initial capital of $100, the actions predicted by the system are applied to this amount of money, which will be modied according to the real series' uctuations. This method gives a very simple idea of how the system performs. The main disadvantage of this method is that it does not consider the number of actions performed. For instance, a nal amount of $115 can be fairly good if just 10 actions have been undertaken, but it is a terrible result when 400 actions have been performed, mainly because of the tax applied by the brokers, as explained previously, which will end up in a loss of money. It can be pro-posed as a solution to this disadvantage to divide the dierence between the initial $100 and the nal amount by the number of actions undertaken, but the problem would then be that the more money you are moving, the more impact this action receives, which would not be fair either. 8
  • 13. As a last practical error measurement we can use the average rate obtained in the simulation. Each time an action is performed the original dierence would be added to the rate if the action is right and subtracted if it is wrong, dividing this value by the total number of actions performed. This way an average percentage of the gain per transaction will be obtained with the main disadvantage of not knowing the number of actions. For instance, a rate of 0.7% in 50 actions over a period of 3 months is better than a rate of 1.2% in the same period when just one action has been performed. Something to consider here is that a positive rate does not always result in benets at the end. An example of an extreme case would be, starting with $100, rst getting a +60% ($160) and then losing the 40% ($64) would mean a sublime ratio of +20% but a loss of $4 at the end. To sum up, there is no perfect error measurement for this problem, but there are several ways that combined can give a very good idea of how the system performs. All of them will be used in order to contrast the results obtained in each of them, mainly focusing on the last one of ratio of benets but always keeping an eye on the number of movements carried out. Lastly, the results will be compared with few simple strategies as a Random Walk or the evolution of the market, which would be buying the stocks the rst day of the determined period and keeping them until the last. 3.4 Data preprocessing Before starting with the experimentation itself, the data must be shaped in a way that can be easily read by APRIL-ANN. The rst thing to do is down-load the historical nancial series, as we mentioned above, of Abengoa Abertis (ABE.MC in Yahoo Finance), due to it being a regular share in the Span-ish Stock Exchange, and the data interval will be from 1stJanuary 2000 to 1stApril 2014. The period of time used only to predict will be starting on the 1stSeptember 2013 onwards, or a total of 7 months or 151 days of activity in the Madrid Stock Exchange, and will be the same for all series, such that the results can be compared afterwards. The rst thing to do is to represent each single day of this more than 14 years series as its date plus a single number representing the relative dierence in respect to the day before. With this, we are losing the rst and last elements of the series, because we do not know the dierence between the 1stJanuary 2000 and the last day of 1999, and the same applies to April 2014, leaving us now 150 days of activity. Nevertheless, it is still far better than using absolute prices of the shares, which can vary in terms of magnitude in a matter of days. Once each single day's price dierence of the series is calculated, the input and output of the network have to be generated from them. In the rst and most basic experiment, a regular multilayer perceptron will be used, where the input of the network consists in a sliding window along the series of length N. With this method, the input of the network will be the values from the time t-N to t for predicting the value of t+1 as the Figure 4 shows: 9
  • 14. Figure 4: Time line showing the sliding window used in order to predict the values of t+1 and t+2 respectively When the value t is available, the window from t-N to t is used to calculate t+1, and when t+1 is available, the window slides one position, from t-N+1 to t+1 in order to calculate t+2. The output used in this rst model consists of a binary vector of three elements for each sample, representing the ideal action to perform on that day, according to the tendency of the series: down {1,0,0}, remains {0,1,0} or up {0,0,1}. As the aim is to maximize the benets, the market going down will be understood as a sign of selling, the market remaining constant as do not perform any action and the market going up as buying. The threshold used to decide when to remain is when Abs(value) 0.65, meaning that the ideal action would be buying when the share increases its price more than 0.65%, selling when the share's price is -0.65% or lower, and remain inactive otherwise. With this distribution there will be approximately one third of each of the actions along the series. Another matter to consider is the initial length of the series to analyze, as well as other parameters like the size of the sliding window N. Dierent values for both these parameters will be tested and analyzed in further sections, but the concern for the moment is the repercussions of these parameters in the nal length of our series. A starting date for the series will be needed, meaning that no data prior to that date will be available for the experiments at all, and the window size will need some initial data before the rst sample is available. For instance, if the starting date is the 1stOctober 2012 and the window size is 4, the rst sample available will be on the 4thOctober, because the rst 4 days will be used to generate this sample. The second sample will contemplate the values from the 2ndto the 5thof October and so on. To sum up, it should be kept in mind that the window size has to be subtracted from the initial length of the series to obtain its nal length, something that has the potential to cause some problems if it is not considered, mainly when using big window sizes and/or recent starting dates. The last important part of the preprocessing of data would be the splitting of the data. As previously mentioned, this rst experiment will be a simple mul-tilayer perceptron, such that data must be split in three datasets; one for the training of the network, another for a rst validation of this trained network, and a third dataset for a second validation of the system, which comprises a xed length from 1stSeptember 2013 to 31stMarch 2014 in all the experiments, regardless of the size of the other datasets used. The remaining data, includ- 10
  • 15. ing all the samples older than the second validation period will be split into training-validation 1 with the proportion 0.75 for the rst, and the remaining 0.25 for validating the trained system. The data from April onwards will be used afterwards to test the network selected in base to its performance in the validation 2 dataset, as can be seen in the Figure 5. Figure 5: Time frame where the splitting of the data is shown for an undeter-mined starting date. The problem mentioned previously can appear with this way of splitting data, as depending on the starting date and the window size the number of samples might not be enough to cover the whole set of needed dates for the second validation. Or the remaining data for training and rst validation might not be large enough after removing the validation 2 samples. In these cases, the experiments will simply not be considered. For example, it would not make sense to set an experiment with data from the 1stJuly 2013 and a window size of 30, basically because the dataset for training and validation 1 will be of just 14 samples (10 training and 4 rst validation), and the samples for the second validation will still be 150. As a nal comment, it should be remarked that the data classied as vali-dation 2 is the typical testing dataset in the train-validation-test splitting, but a further test dataset will be used, and the best parameters will be chosen in order to maximize the results obtained within this validation 2 period. The real potential of the experiments will be shown in the test dataset in a more recent period from the 1stApril 2014 onwards. 3.5 Post-process of the data Another highly important point of the experimentation is to process the data after the networks have been trained, matter covered along this subsection. First, immediately after the execution of the training, the second validation dataset will be processed by the best network according the rst validation dataset and its error outputted in a summary le that will be created for each single conguration of parameters, where information as the evolution of the training is kept, such as the epoch where the best net had been trained or both validation datasets' errors. The dierent performance measurements are calculated on the validation 2 dataset as well. First, for each sample of the dataset in a sorted order, its pre-dicted action is calculated, and simulated in an amount of $100 from September 2013 to March 2014. It is taken into account if the action was a success (buy 11
  • 16. when the series goes up and sell when it goes down) or not, and the condence of each action is stored as well, calculated as the ratio between the greatest neuron's output and the second greatest in a natural scale after the activation function is applied. Assuming o1 as the greatest output, and o2 as the second greatest, the ratio would be the exponential of their dierence, and the con- dence would be 1 - ratio. After every single sample of this dataset has been analyzed, summarized information as the number of actions, the ratio per ac-tion or the success percentage is calculated, in order to have an outline of every dierent trained network. After the execution of each network, a trace of its behavior is saved in a corresponding le together with some interesting information as the condence of each action performed. After the execution of the networks, one of the faced problems is that the number of actions might be too high, driving the results to a low performance. The rst idea coming up to solve this problem is to use a xed threshold, so that all the actions with a condence lower than this threshold will be ignored. The problem that appears here is that in some executions there are no actions performed at all because this threshold is too restrictive, but in some other executions the threshold does not bound any actions of the set. The solution, to use a variable threshold depending on the set of condences of the series. A parameter indicating the percentage of actions to consider will be needed, and the action plan will be to sort the list of condences and with the parameter's help choose the one that will act as a threshold. This way dierent series with dierent condences can be compared, because the task is done with relative numbers instead of absolute ones. Further restrictions can leave some experiments out of consideration. One of them is a minimum number of movements required after the threshold is applied. A set of 150 samples is considered for the second validation, so for instance an experiment that ends up performing only one action out of these 150 possible ones cannot have very good odds, so it will be discarded. Another constraint will be the number of the best epoch, as networks that are classifying the data randomly are not desired. The starting weights of the network's connections are set randomly, and if after 200 epochs of training, the rst epoch is still the best one, something is not going well, as the training of the network has not been able to improve a random one, so it would be discarded. As an example, a results set where the predicted actions are: 50 samples buy, 50 samples keep and 50 samples sell; which means a total of 100 proper actions. Assuming that the best epoch was high enough, a hypothetic top 5% of the result would be quite poor, because only one out of thirty actions would be performed, but a higher threshold as 25% would probably be better, as now one out of 6 actions will be carried out. When it comes to the practice, this parameter will have to be examined as well in order to nd the optimum percentage of samples to take into account. As a minimum number of movements needed the threshold will be set to 8, same value as the best epoch's minimum number of the network's training. 12
  • 17. 4 Models used 4.1 The basic model As it was mentioned before, the initial model will be a regular multilayer per-ceptron with backpropagation as its training method. The rst thing to do is normalizing all the data to standard deviation equal to one and mean zero in order to equally distribute the data and facilitate the learning. When creating the neural network, one hidden layer with logistic as the activation function in its neurons will be used, and in the three neurons of the output layer, the function chosen will be logarithmic softmax. The loss function used to train will be cross-entropy, which computes it between the given input/target patterns, interpreting the ANN component output as a multinomial distribution. A batch size equal to the number of training samples will be used, meaning that all the samples are read before the actual network is updated, which means more computation time to process each step, but more accurate steps. The initial weights will be randomized at the beginning, between the values -0.1 and 0.1 with the purpose of having a neutral network before the training. A pocket algorithm will be used, meaning that the network with the best results will always be available, even if afterwards new training iterations worsen it. The network will keep training until the current epoch's error is twice as big as the error of the best epoch, with a minimum of 200 iterations and a maximum of 3000. The parameters needed to tune the training of the network will be given as the arguments of the APRIL-ANN program, in order to use bash scripts afterwards that will wrap the execution of the network together with more dependent and depended scripts, which are the size of the hidden layer, the learning rate and the momentum. Finally, the system needs to be tested, and a scan of parameters will be done for this purpose. The rst parameter will be the starting date of the series. The series are available from January 2000 to August 2013, as September 2013 and after is part of the validation 2 dataset. Starting dates from January of all the odd years from 2000 to 2012 and from 2011 and 2013 will initially be used together with dates starting in July of the years following 2010. The size of the sliding window is another important parameter to scan, which also sets the size of the input layer. The initial set of values to check here goes from 5 to 200, in order to have an initial idea and proceed with more concrete values afterwards. The next interesting parameter will be the size of the hidden layer, aecting the topology of the network. The values used here are the same as for the sliding window and again, further experiments will be performed for the values that are close to the best results. Another variable for the performance of the network is the learning rate, which initially will be analyzed from 0.001 to 0.5. Given that this parameter strongly depends on the size of the network, which in this case is determined by the sliding window and the hidden layer sizes, new scans will have to be done once the range of these two parameters is smaller. The last parameter to analyze is the momentum of the network. Here not so many options are needed, so the starting values will be 0.0, 0.05, 0.2 and 0.4. 13
  • 18. 4.2 Variants of the model Now that the basic architecture of the network is understood, some dierent modications will be exposed before going on with the pertinent results. First, a change in the input of the network will be presented, then an alternative for the output, and nally a modication of the training process will be explained, changing the order the data is given to the model. Finally, in a new section, a hy-brid model will be presented, as an attempt to put together the main advantages of both learning ways, with two slightly dierent alternatives. 4.2.1 Sliding window vs Averages as inputs The easiest of the proposed modications aects the input information that is being passed to the network. In the basic model, a sliding window taken directly from the original series was the input. The main problem that this method presents is that in order to recognize a new pattern with a high condence, an almost identical one should have been used for training, which is very dicult taking into account the noise allocated in the nancial series. Another way of seeing it is that, this way, the network is learning the data by heart, which makes it dicult to generalize afterwards. The proposed alternative is to use averages instead of the raw values from the series, with the objective of learning the tendency of the series more than the number themselves. Instead of having the window size as one of the variables of the system, this variable will be a vector where each element represents the amount of values used to calculate each of the averages that will be used as an input, always starting from value right before the one that wants to be predicted. For instance, the vector {9,6,3,1} would mean that the rst element of the input layer would be an average of the 9 last elements, the second would be the average of the last 6, the third would be the average of the last 3, and the last one would be the average of the last one element, in other words, the last element itself, as can be appreciated in the Figure 6. Figure 6: Gathering of information to generate four inputs in an {9,6,3,1} aver-ages model. 14
  • 19. In the previous picture it can be seen that the size of the input layer in this case would be of four neurons, each of them denoted as i#, where the hashtag represents its number, containing the averages of the values encompassed in the gure. To get a better understanding of the dierence of both methods, a series is represented with both methods in the Figure 7, where the averages model uses a vector {20,15,10,5}. Figure 7: Comparison of the uctuations of the same data represented as Raw and as Averages of the last values from t-20 to t. In the gure can be seen the values of a series' uctuations from the last twenty days in both raw values and averages of 20,15,10 and 5. The one using averages is more general, but contains less information, hence is easier to learn. 4.2.2 Three-class classier vs Forecasting model The next interesting change of the basic model aects the output of the network, where in the basic model a binary vector was used representing if the day after the market went up, down or kept its value. The alternative approach consists of replacing this output layer of three neurons by a layer with a single neuron, which contains the real value provided by the nancial series. The main benet of this resides in the fact that with only one output, there is no reduction of information in the model. In other words, the model with three outputs considers a raise of 0.7% and a raise of 5% as the same, when the real repercussion that the second causes is much higher than the one caused by the rst. Or a slight dierence between two similar values, such as 0.64% and 0.65%, which are pretty much the same but are considered as completely dierent outputs. An example of the dierent types of outputs can be seen in Table 1. 15
  • 20. Date Current value Trend class Forecast 2013-01-31 -2.445 {1,0,0} -1.585 2013-02-01 -1.585 {1,0,0} -3.768 2013-02-04 -3.768 {0,0,1} 2.197 2013-02-05 2.197 {0,1,0} -0.462 2013-02-06 -0.462 {0,1,0} -0.516 2013-02-07 -0.516 {0,0,1} 2.000 2013-02-08 2.000 {1,0,0} -1.177 2013-02-11 -1.177 {0,0,1} 1.932 2013-02-12 1.932 {0,0,1} 0.868 2013-02-13 0.868 {1,0,0} -0.707 2013-02-14 -0.707 {1,0,0} -1.178 2013-02-15 -1.178 {0,1,0} -0.506 2013-02-18 -0.506 ? ? Table 1: Example of the dierent outputs for the IBEX35's series in a comprised period from 2013-01-31 to 2013-02-18. This modication entails two main dierent changes in the neural network apart from the topology. One of them is the activation function of the output layer, which until now was a softmax, but as from now on the values do not need to tend to a discretization, but are continuous instead, the activation function will be linear instead. The other change regards the loss function, which for the classier model was a cross-entropy, but as there is only one value now, this function would no longer make much sense, so it will be changed to the mean squared error (MSE). With only one output the problem becomes a forecasting model instead of a classication in three classes as it was before. As mentioned previously, the principal advantage is the dierent importance of each value for training, in order to distinguish strong and weak tendencies, but there is also a negative side, predominantly concerning two problems. The rst one is merely technical; basically a forecasting model is not as stable to train as a classication problem. A forecasting model is more likely to diverge, mostly when high learning rates are used, but it also does not have to to converge when smaller rates are used. As lots of dierent experiments are run, sometimes it can be very dicult to know if the network has converged enough or not, as with the highly noisy nature of the data a random network can easily provide decent results that can lead to confusion. The second problem to face with this method is more practical, and resides in the fact that the highest short time peaks of the stock market are normally caused by important news, which is in fact unpredictable. This means that the samples that will have a greater impact on the system are the ones that probably should not be learned by it, although they are not abundant. 16
  • 21. 4.2.3 Traditional MLP vs Model with data overlapping This last modication regards the organization and order the data is given to the system to train, validate and test. It arises from the idea of the dierent contexts that might have an eect throughout a series. Social, economic and historical features are very dierent nowadays than before 2005 for instance, mostly with an economic crisis in-between, which make markets behave dierently. For this reason, the objective of this modication is to train with data chronologically closer to the data that is going to be predicted, in order to reduce the dierence of contexts. In the regular MLP model explained up to this point, the second validation data was comprised from 1stSeptember 2013 to 31stMarch 2014, the rst valida-tion data was the more recent 25% of the remaining samples, and the training data was the remaining oldest 75%. When the series are long, and they can be of as long as 14 years, a big gap exists between the data used to train the network and the data used to do a second validation and or a test. Concretely, starting in January 2000, the last data used for training is from the beginning of 2010, which leaves more than three years used for a rst validation that ba-sically means predicting data of 2014 using a network trained with data older than 2010. This is an extreme example where the easy solution would be to sim-ply reduce the size of the series, as probably so much data will not be needed, but even if the data were reduced, the same problem would appear in a smaller scale. The proposed solution for this is to avoid the rst validation dataset in order to put training and testing datasets closer. To do so, a model where only one dataset exists is proposed and the network iterates over it in chronological order. Given a concrete sample, it will be used to test the network and in the next iteration will be used to train it, whilst it will be tested with the following one. With this method, each single sample of the network will be used rst to test it and afterwards to train in order to not test samples that have been used to train. For instance, assuming a new iteration, the rst thing to do would be to use the current sample t-1 to train the network and immediately after the sample t will be used to test it. In the next iteration, t will be used to train the network and t+1 to test it. This sequence will continue until the last value of the series has been used to test it. The errors are calculated the exact same way as they were before, with the only dierence being that in the overlapping of data model they are calculated whilst the training is being done. Also, the overall splitting of the data will be kept, meaning that until 31stMarch 2014 the samples will be used to train and validate, and from April 2014 onwards the samples will be used to train and test. As one longer continuous dataset will be used for the second validation and testing of the data, the only dierence will be that the best results until March 2014 will be picked, and there will be no choice from April onwards. A simple outline of this process can be seen in Figure 8, replacing the regular method shown in Figure 5. 17
  • 22. Figure 8: Scheme showing the proposed method with overlapping of data in one single dataset. The main advantage this model has is the absolute utilization of the data to train the network, and the factor that by learning all the samples in a chronolog-ical order, the more recent the sample is, the more impact it has on the system, so that it will forget old samples by learning new ones and modifying the system according to these. On the other hand, there is a big disadvantage: the undertting, which will be explained in detail in section 7.2 and can occur due to the network using each sample to train only once during all the process. This can be patched up by increasing the learning rate or by using an adequate num-ber of iterations determined by the length of the network, so that the network will iterate the correct number of times. Both learning rate and series' length will be parameters to scan and analyze afterwards, as will be seen in posterior sections. 4.3 Hybrid model with overlapping of data Up to this point, dierent modications have been explained with slight dif-ferences from the initial model; a traditional multilayer perceptron. The most uncommon model is probably the one with the overlapping of data, which does not use a typical splitting of data that neural networks normally use, adding the 18
  • 23. small advantage of seizing the data better than regular models at the expense of the huge disadvantage consistent in data undertting. In order to abate this, two new models will be presented with the purpose of avoiding the undertting problem, but without losing the advantageous data seizing. 4.3.1 Explicit validation dataset The rst of the alternatives is also the simplest one, based on the overlapping model; a full training of a network is done for each new available sample. Another way of understanding it would be starting from the basic model and using only one sample as the second validation dataset, instead of the 150 samples used before, and iterate over the old whole set. Once this prediction is performed, all the datasets advance one sample in time, so that the predicted sample is now used as the last one of the validation dataset, while the following sample will be predicted now. With this model, a completely new articial neural network is created for each sample that is needed to be predicted, having a dierent num-ber of the best training epochs, as they depend on the samples of the datasets, which are being modied each time. The number of samples used in total for the prediction of each tendency in time would remain constant for each of the predictions, as gure 9 shows. The splitting of the training and the rst valida-tion dataset will be kept as 75% and 25%, same as it was in previous models, and the starting date will be considered as a network's input parameter as well. Figure 9: Training methodology of the model with validation dataset. 19
  • 24. There is an obvious disadvantage in comparison with previous models, which is that the time spent by the model to predict the samples increases considerably, as now one network is trained and used to predict each sample. In this case, where 150 samples are available along the validation dataset, the time spent in previous models gets multiplied by 150. Due to the nature of the problem, where only one more sample is available per weekday, this is not a big issue, as in between the closing time and the opening time of the following day there is plenty of time for training the new models and predicting the new tendencies. However, the process of looking for the correct parameters is very expensive in computational terms, taking around 150 times longer than previous models where the whole second validation dataset was predicted with the same trained network. A positive side of this is problem is that because this model is consider-ably similar to the anterior ones; the scanning of parameters should not be very wide, as the ideal parameters for the other models are already known. Hence, the scanning of parameters should shorten, with its corresponding reduction of time. 4.3.2 Implicit validation dataset The last of the models to analyze is an evolution of the previous one; the hybrid model with validation dataset with certain characteristics of the overlapping model. The main idea of the model is to use a training dataset chronologically as close as possible to the sample that is to be predicted each time by moving the validation 1 dataset used in previous models. If it were just removed, the problem faced would be that the stopping criterion would be undened, as it is set according to this validation dataset. The proposed solution to this problem is to set a x number of training iterations for each sample's prediction, determined by the best epoch obtained in previous full trainings with a validation 1 dataset. The prediction of a determined sample is performed as follows: rst, a net-work is trained using both training and validation 1 datasets in order to predict the sample that is immediately after the validation dataset, as was done with the previous model; then the number of the best epoch is kept and the trained network completely discarded; later a training dataset of the same length as the one used before is taken from immediately before the sample to be predicted, and the training is performed with the same parameters during the stored num-ber of epochs; nally the resultant trained network is used to predict the sample. In Figure 10 a detailed process of the training method is shown, where the rst part of each iteration is used to get the number of the best epoch and the second to train the actual network with a xed number of training epochs. 20
  • 25. Figure 10: Training methodology of the hybrid model with implicit validation dataset. A problem that might appear with this method is that sometimes when a network is trained the number of the best epoch could be one, meaning that the training has not improved the initial random model. If the parameters used are correct, this will not be a common problem, but can still happen. The proposed solution to this problem is to use more than one previous training to determine the xed number of epochs, by calculating their average. The number of old best epochs used to calculate the average will be seven, the last consecutive ones. 21
  • 26. When the best epoch of a training equals one, the average drops down and sometimes it can have a great impact on the average number of epochs. To solve this, the solution is to remove the lowest of the seven epochs from the average, removing as well the greatest so that the average does not become unbalanced. At the start of the series, the average of the rst networks will be used as the training epochs for the same number of networks because no previous data is available. Table 2 shows an example of a series where the number used to calculate the average is ve, three after the removal of the lowest and greatest best epochs. Sample number 1 2 3 4 5 6 7 8 9 10 11 12 13 Training BE 18 25 20 19 17 24 29 14 25 1 21 19 20 Iterations 19 19 19 19 19 21 21 20 22 21 20 18 21 Table 2: Example of the number of iterations calculated out of the best epochs from the previous 5 samples. In the table can be seen the resulting number of iterations out of the last 5 samples. For instance, for the sample number 10 the values 24, 29, 14, 25 and 1 are available. Removing the greatest and lowest, which are 29 and 1 respectively, the values 14, 24 and 25 are remaining. Calculating their average, the obtained number of iterations is 21, which will be set for the network's training used to predict sample number 10. It is important to mention that this model takes on average almost twice as time as the hybrid with a validation dataset, which was already taking 150 times more than the initial models. It rst needs to train the networks the exact same way as the previous model was doing, and afterwards train a new network with dierent samples up to the best epoch of the previous training. When searching for the best epoch, the training keeps iterating even though when the errors obtained are worsen the best one, at least the 50% more of the current best epoch's number. This means for instance that in a training where the best epoch is reached during iteration number 600, the network will iterate another 300 epochs, until epoch 900 and if the best epoch is still number 600, then it will stop. During the training of the actual network, only 600 iterations would need to be performed, a considerable saving of time depending on the case. In general terms, it can be said that this second hybrid method is approximately 70% more expensive than the rst one. 22
  • 27. 5 Results In this section a comparison of the dierent alternative models will be shown starting from the basic system's results. Note that as mentioned in previous sections the validation 2 dataset comprising from September 2013 until March 2014 will be used to measure each system and modication, and further analysis will be done in order to check the networks with unknown future values of the series. The rst thing needed is a baseline for the results, so several random walks were generated for the series. For each single day of the validation 2 period, one action has been randomly picked up with an equal distribution out of buy, sell or keep. Table 3 shows some information of ten random walks executed, considering as the number of actions the sum of both buy and sell actions, excluding the keep ones. # Final money Actions Benet/action Success rate 1 $84.03 88 -0.192% 42% 2 $97.94 102 -0.015% 47.1% 3 $108.88 95 0.09% 50.5% 4 $113.07 92 0.14% 51.1% 5 $107.28 107 0.072% 56.1% 6 $90.85 103 -0.087% 44.7% 7 $108.97 100 0.091% 45% 8 $107.76 91 0.088% 54.9% 9 $93.84 109 -0.053% 46.8% 10 $100.26 94 0.009% 45.7% Average $101.288 98.1 0.000143% 48.39% Table 3: Execution of ten independent random walks showing the nal amount of money, number of movements, ratio of benet per action and success rate, together with the average of them all. In terms of the money obtained, the table shows that on average, using a random walk strategy, the benet after 98 actions would be $1.288, not very good. The best performance of the random walk (number 4) has obtained a benet of $13.07 with a rate of 0.14% per action. On the other hand, the worst has been number 1 with a total loss of $15.97, meaning an average loss of 0.192% per action. The median execution is also the closest to the average, number 10, remaining very close to the initial sum of money, with $100.26. In Figure 11 can be seen a comparison of the best, worst, and median random walk executions together with the uctuation of the original series itself. 23
  • 28. 140 130 120 110 100 90 80 Original Best Worst Median 01-09-2013 15-12-2013 31-03-2014 Figure 11: Comparison of the best, worst and average random walks against the original series. In order to have a better idea, another 100 random walks have been executed, showing an average end money of $99.08 with a standard deviation of 10.66. This reinforces the idea that the series is not biased neither to win nor lose money, but to maintain its value. 5.1 Basic model As mentioned in previous sections, a scan of parameters is performed, generating a big amount of experiments run. An easy and quick solution would be to choose the experiment that has made the maximum amount of money without any kind of boundaries in the output, which in the present experimentation consists in using the series starting in January 2004, a window size of 140, a hidden layer of 10 neurons, a learning rate of 0.35 and absence of a momentum. This conguration has managed to obtain $136.8 out of the initial $100 in 150 movements, with a success factor of the 55.3%. The problem is the lack of stability of the results with similar parameters. For instance, modifying the momentum, which is the parameter that probably aects the least to the system from 0 to 0.1, the benet of $36.8 turns into a loss of $22.2 of the initial money, dropping the quantity to $77.8.This means that the reliability of the result is very poor, and has obtained the results quite randomly, without learning much from the series. Analogously to the best result in terms of absolute money, a 24
  • 29. maximum benet rate of 1.1% per movement has been obtained as well as a success factor of the 100% in other experiments, but none of these experiments are relevant for the same reason as the one explained before. The objective of the analysis is to nd a cluster of experiments with similar parameters and decent results in order to give some reliability to the parameters used. But before that, a simple postprocess has to be applied to the results, con-sisting in considering just the top x percent condent action for each experiment, as explained in the post-process section. In Table 4 dierent top percentages are compared for a same experiment (window size 140, hidden layer size 35, learning rate 0.05, and momentum 0) with good results, together with their nal eect on the initial amount of $100, the average ratio of benets per movement and the cross-entropy error of the set: Top percentage Final money Actions Benet/Action Error All actions $93.9 88 -0.06% 1.10 80% $98.2 74 -0.02% 1.09 70% $100.3 65 0.01% 1.09 60% $109.4 57 0.16% 1.08 50% $114.7 48 0.29% 1.08 40% $114.5 45 0.31% 1.06 30% $120.0 37 0.5% 1.00 20% $117.1 29 0.55% 0.95 15% $116.5 21 0.73% 0.87 10% $114.5 15 0.91% 0.81 5% $110.4 7 1.43% 0.61 Table 4: Final amount of money and average ratio of benets per action for dierent ltered top percentages applied to a same experiment. Table 4 illustrates that a greater nal amount of money is not always a better result, nor a higher ratio of benet per action. Performing the 100% of the predicted actions in this example (the entire buy or sell actions out of the 150 days), there would be a loss of $6.1. Using the top 10% the nal amount of money would be the same as using the top 40% but the average gain are dierent, 0.91% against 0.31%. Even though the amounts of money are the same, the top 10 percent is clearly more convenient considering the commission charged by the brokers explained at the beginning of this document. Also, as a lower number of actions is required, a higher benet per action is reached meaning that less risk is taken. The highest ratio per action has been reached by the top 5%, but not all the potential of the model would be seized, as using the top 10% or 15% there is a lower ratio, but more actions are taken into account generating a higher amount of money, which is probably worth, at least, consideration. If the decision were to invest uniquely in this series, a higher percentage of actions would be more recommended. For instance the top 30%, which has a good ratio of benets per action using a decent number of actions that would 25
  • 30. increase the money without investing too much or too little, and has managed to gain the highest amount of money among the tested top percentages, $120.0. If more series are taken into account, tighter top percentages would be better options, as at every moment one action per series would be available, meaning that just using the top actions of each series, a high number actions would be performed over time among all the considered series. The Figure 12 shows the behavior of each top percentage's eect through time on an initial amount of $100: 125 120 115 110 105 100 95 90 Top 20% Top 30% Top 40% Top 60% All 01-09-2013 15-12-2013 31-03-2014 Figure 12: Timeline showing the behavior of applying dierent top percentages to a same results le through time. Finally, as the best percentage for one single series was between 20 and 40%, it is decided to use the top 35% of the actions, and the rst consequence is an augment of the average benets per action, as expected. The action that got the highest amount of money with the top 35% got a total of $125.8 in 52 movements. This is the best result so far, as the best execution using the total of actions managed to end up with $136.8 out of 150 movements; around one third more money with three times more movements. The parameters used for the current best experiment are: starting date, January 2011; window size, 80; hidden layer size, 35; learning rate, 0.35; and a momentum of 0.005. If instead of picking the best experiment, the set of results gets pruned little by little until just a few good results are left, the set of results would end up with the following constraints: starting date, only January 2012; window size, 26
  • 31. between 90 and 100; hidden layer size, between 15 and 45; learning rate, lower or equal to 0.05 and momentums lower than 0.1. This constraints oer a set of more than 70 experiments, where the poorest performance ended up with a gain, $104. The nal parameters used are not the ones that performed the best, but the ones that are in the middle of a set of best performance, which are the following: window size, 100; hidden layer size, 15; learning rate, 0.05; momentum, 0.005. These parameters obtained a total of $109.1 in 11 movements, meaning a rate of 0.8% and a success rate of 81.8%; very good results. Now the results need to be tested in order to know the real obtained poten-tial, and for this purpose a test dataset is available, comprising from 1stApril 2014 to 31stJuly 2014. Also, the length of the series used for training will be con-stant, meaning in this case that instead of starting the series on the 1stJanuary 2012, it is starting on the 1stAugust 2012, as there is a 7 months lag between both datasets. Another thing to take into account is that in the previous val-idation dataset 150 days were available in total, and in this test dataset the amount of days has shrank to 87 days, as the dataset has shorten from 7 to 4 months. When the parameters are applied to a network generated for the test dataset, only one action out of the 87 possible ones is carried out, on the 29thApril, where the predicted action was to buy and the market went up a 0.44%, meaning a nal amount of money of $100.44, and obviously a success rate of the 100%. This result is rather disappointing, as more actions were expected, although the benet per action is still fairly good. After this, no further experiments with the basic model will be done, as the hybrid systems are expected to be more powerful than the present model, so the eort would be put on them. 5.2 Averages as inputs The rst of the proposed modications was the substitution of a sliding window by the use of averages of recent values, as explained in section 4.2.1. Coming to the execution of these networks, the rst striking part of these experiments is the speed improvement when training. While with the sliding window the input layer could have sizes of up to 140 neurons, with this model they will rarely have more than ten neurons in the input layer. This is a huge reduction of the computational time, as every hidden layer's neuron is connected to all the input layer's ones. On the other hand the pre-process of the data takes a little bit longer, as the averages have to be calculated, but this time is far less than the time saved during training. Furthermore, this does not have to be done for each single training of the networks, one pre-process of data is needed for each input pattern, comprising a lot of networks available to train with dierent parameters. The scan of parameters is pretty much the same as in the previous model, but instead of using a number for the window size, now a list of number is used, with patterns as 50-30-20-15-10-5-4-3-2-1, 20-15-10-5, 20-15-10-5-3-1, 100- 80-70-60-50-40-30-25-20-15-10-5 or 25-20-15-10-5. Other patterns checked were 50-49-48-...-3-2-1 with dierent lengths instead of 50. However, the best results 27
  • 32. were performed by the simplest patterns, like the multiples of 5 up to 20 or 25. Multiples of other numbers were tried, but the results were not better than with 5, so the most of the experiments were stuck to this kind of patterns. First of all, it is important to note that the best experiment using the total of the actions was using the following parameters: starting date, January 2011; pat-tern, 100-80-70-60-50-40-30-25-20-15-10-5; hidden layer size, 20; learning rate, 0.5; momentum, 0.005. It managed to get a total of $134.4 in 87 movements. Again, it is not desired to perform the total of the actions, so only the 35% top most condent actions would be taken into account. Using the top 35% of the actions, the parameters that performed the best result are: starting date, January 2008; pattern, 30-25-20-15-10-5; hidden layer size, 100; learning rate, 0.5; momentum, 0.02. The nal amount of money was $138.8 in 56 movements. Note that, even though the 35% of 150 actions are considered, which will suppose a maximum of 53, 56 actions are performed. This is due to the fact that dierent actions might have the exact same condence, and when this happens with the cut-o action that separates the ones considered and the ones discarded, all the equal condence actions are considered as inside the threshold. Similarly to the basic model, the very best experiment probably will not be the most convenient, so another pruning process is carried out over the present set of results. Finally, a good bunch of results is obtained using the following boundaries: starting date, January 2008; patterns, 25-20-15-10-5 and 40-39-38- 37-...-4-3-2-1; hidden layer sizes, 100 or lower; learning rates of 0.01 and 0.05; momentums of 0.0, 0.005 and 0.02. Note how two extremely dierent input patterns provide the best results, while very similar patterns to both of them were discarded as the obtained results were not as good. The remaining set is formed by more than 50 experiments, and excluding two punctual results that ended up with a loss of almost $10, the experiments' gains oscillate between $119.9 and $102.1, with an average of $109.27 including them all. When it comes to picking the best result, it is generated by the following parameters: starting date, January 2008; pattern, 40-39-38-37-...-4-3-2-1; hidden layer size, 100; a learning rate of 0.05 and a momentum of 0.005. The experiment ended up with a total of $115.1 out of the initial $100, obtained in 45 movements, and with a cross-entropy error of 1.05. If the experiment is replicated with the test dataset, the results obtained are: a nal amount of money of $106.82 in 24 movements, 0.28% benet per action, 66.7% of success rate and a cross-entropy error of 1.072, quite consistent with the results obtained during the validation period. 28
  • 33. 5.3 Forecasting model In this section the results obtained with the second modication will be pre-sented, consisting in the replacement of the three neurons output to a one single one, changing the model from a classier to a forecasting model, as explained in section 4.2.2. The input of the model used is a sliding window, as the modi- cations are made to the basic model. The scan of parameters is done the same way as for the initial model, as the only dierence is the output and it cannot be changed during the experiment. As this model is not as easy as previous ones to train because it is more prone to diverge, smaller learning rates will be used. Instead of using a smallest learning rate of 0.001 as it was before, for this experiments the minimum value of this parameter is ten times lower, going from 0.0001 to 0.1. After all the experiments have been run, the rst remarkable fact is that to the naked eye, the set of results is more prone to gain money, unlike results of previous models where the results tended to remain around the initial amount of money without any clear tendency. In this set of experiments, the parameters that performed the best in terms of nal amount of money are the following: starting date, January 2013; window size, 35; hidden layer size, 25 neurons; learning rate of 0.1 and no momentum. With a total of $144.4 in the 150 move-ments, a MSE of 0.5584 and a success rate of 58%. In the experiments of this model, all the initial results are performing the total of 150 actions, as there is no possible keep action available, just the numerical prediction of the market going up or down, increasing the number of actions performed in comparison to classier models. Taking into account only the top 35% of the actions the best result is ob-tained by: starting date, January 2013; window size, 35; hidden layer size, 25; learning rate, 0.1; and no momentum. With a total of $139.6 in 143 movements, it can be said that it is a very poor result, rstly because the highest learning rate is used as well as the shortest period of data, making it not very likely to perform the best; and secondly because has performed 143 actions out of 150 when taking just the top 35%, meaning that more than 100 predictions along the series have the exact same condence, which is not a good symptom at all. When pruning the set of results, 88 good ones are obtained with the following boundaries: starting date, January 2012; window sizes, from 20 to 35; hidden layer sizes, from 20 to 80; learning rate, 0.0002; and momentums up to 0.1. The set of results is excellent, except one execution that regardless its momentum ended up with $98.4, the amount of money in the set goes from $103.5 in the worst of the cases to $127.5 in the best of them, slightly better than the results obtained with the classiers. On the down side of the selection of results, it can be noted that the average number of movements in the set is higher than in previous experiments, mostly 52 movements, and a little bit more for a few experiments. An exception is, for a window size of 35 and a hidden layer of 50, depending on the momentum, the number of movements is 150, 147, 67 or 77 meaning that no good distinction has been made by using the top 35%. When it comes to the election of the best parameters, a good option is as follows: starting 29
  • 34. date, January 2012; window size, 20; hidden layer size, 20; learning rate, 0.0002; and momentum, 0.005. These parameters managed to obtain a total of $118 in 52 movements, meaning an average of 0.32% per action, a success rate of 57.7%, and a MSE of 0.5. In comparison to previous models, the number of movements is very high. In order to mitigate this and try to increase the average benet per action, the top 35% will be reduced to the top 20%. Analyzing the same bounded set of results, the average benet rates have increased in general, they are still all gaining money apart from the same one as before, which now is losing slightly less, ending up with $98.5. The typical number of actions has gone down from 52 to 30, and the result marked as best, now is showing a results of: $114 in 30 movements, MSE of 0.58, a benets ratio of 0.44% per action and a success rate of 60%; good results for such high number of movements. Applying the best parameters to the test dataset from 1stApril to 31stJuly 2014 the results are rather disappointing, as with the top 20% the nal amount of money is $98.25, in 17 movements, meaning a loss of 0.1% per action. When the rest of the parameters included in the set of good results are run in the test dataset, the results do not seem to improve, as now there are more exper-iments with losses than with gains. This is due to, apart from the problem of extrapolating results that will be explained in section 7.3, during the training of both models the series' tendency was bullish, strongly aecting the forecasting model's training by biasing it towards buy actions, but during the test period it was not. The series high noise made the learning very dicult, ending up in a very short range of condences very close to the average, which in these cases was positive (bullish series), hence the buy predictions were abundant. 5.4 Overlapping of data In the present section the results obtained from the model explained in section 4.2.3 where the regular MLP was replaced with a model with data overlapping will be shown and explained. One thing to mention is that the learning rates used will be bigger than the ones used in previous models, as the maximum number of training iterations is limited, while it was not for previous models. Also, note that this modication will be applied to the basic model, a classier with three outputs and a sliding window as the input of the network. When analyzing the results, the parameters performing the best in terms of the nal amount of money are the following: starting date, January 2013; window size of 30 values; hidden layer size of 50 neurons; learning rate, 0.25; and momentum of 0.02. They ended up with a total of $135.3 in 80 actions, which is a 0.44% benet rate with a success ratio of 58.8%. With an overall picture of all the results and taking the top 35% of the actions, it is remarkable how the amount of money varies depending on the starting date of the data. One of the possible a priori commented problems was the undertting of the data, which would have been overcome with the fact that longer series would be used but old data would have been forgotten by the learning of newer data. The results have demonstrated this armation, as Figure 13 shows: 30
  • 35. 125 120 115 110 105 100 95 90 85 80 2000 2002 2004 2006 2008 2010 2012 2014 Figure 13: Fluctuation of the experiments' money obtained from September 2013 to March 2014 considering series starting in dierent points of time. As can be seen in the above gure, no big dierences are appreciated with naked eye along the dierent tested starting dates. Networks with series start-ing in January 2000 are trained for more than 3500 epochs before the validation datasets is taken into account, whilst network trained from January 2013 are trained for barely 150 epochs, being their results not that dierent. This is be-cause for older series the network updates its weights with the newer samples forgetting older samples. Also, it is proven that networks learn more start-ing with random weights in their connections [14], and due to the high noise the training of this data is not far from a random initialization, minimizing the training carried out from older samples. Note again that for all the experi-ments explained in this document, the networks' connections weights have been initialized with random values between -0.1 and 0.1. Using the top 35% most condent movements, the best parameters change to: starting date, January 2008; window size, 25; hidden layer size of 60 neurons; learning rate of 0.45 and a momentum of 0.02. These parameters have obtained a total of $121.2 in 27 movements, meaning a very good benets rate of 0.79% per movement. As in previous sections, when trying to prune the set of results in order to minimize the loss, the remaining set is quite big, as 144 results are remaining with the following constraints: starting date from January 2006, 2008 and 2010; window size of 60, 80 and 100 values; hidden layer size of dierent values between 20 and 50; learning rate between 0.08 and 0.12 and the absence of a momentum. In this set the results are not so great as they oscillate from a maximum loss of $0.9 to a maximum benet of $4.5. 31
  • 36. Finally, the chosen best result is created out of the following parameters: Starting date, January 2010; window size, 60 values; hidden layer size, 20 neu-rons; learning rate of 0.1 and a momentum of 0. The nal amount of money obtained is $103.1 in 6 movements, a very good positive benet ratio of 0.52% and a success rate of 83.3%, but with very few number of actions performed, only the 4% of the 150 possible movements. With this group of parameters, the network trained for the test set obtained $100.8 in two movements, a benets ratio 0.4% of and a success rate of 50%. As the basic model, the results are a little bit poor in terms of number of actions carried out, despite the benet ratio is still good. Again, no more experiments will be performed, as the following hybrid systems are a priori more powerful models. 5.5 Summary After analyzing the basic model, the two simple modications and the model with overlapping of data, some early conclusions can be obtained. First of all, in Figure 14 a comparison of the best results obtained with the used top percentage of each model can give an idea of the maximum potential of each experiment. The top percentages in terms of actions' condence are the 35% for all the models but the forecasting one, which is using the 20%. 140 135 130 125 120 115 110 105 100 Basic Averages Forecasting Overlapping 01-09-2013 15-12-2013 31-03-2014 Figure 14: Comparison of each of the experiments' best result through time in the second validation dataset. 32
  • 37. About the previous gure must be mentioned that the forecasting model has the advantage of performing more than 140 actions, as explained in section 5.3, while the basic and averages models are performing around 55 movements and the overlapping one only 6. This means that even though the forecasting model has obtained the greatest amount of money, other models are probably better, as their averages of benets per action are greater. When it comes to the chosen experiment of each model, a comparison of the four models can be seen in Figure 15. 116 114 112 110 108 106 104 102 100 Basic Averages Forecasting Overlapping 01-09-2013 15-12-2013 31-03-2014 Figure 15: Comparison of the well-generalized results of each of the four basic models in the second validation dataset. Using the information shown in the present subsection together with the pre-vious results of the dierent models, the forecasting model can be discarded on behalf of the other basic models. Also, the overlapping model's results have not been bad, fact that provoked the evolution of the model towards the explained hybrid models. Both the basic and the averages as inputs models behaved quite well and seemed to learn some useful patterns. As a last important point, the comparison of the dierent models' results through the test dataset is shown in Figure 16. 33
  • 38. 108 106 104 102 100 98 96 94 Basic Averages Forecasting Overlapping 1-4-2014 1-6-2014 31-7-2014 Figure 16: Comparison of the chosen parameters congurations of the dierent models applied to the test dataset. 5.6 Hybrid models In this subsection the basic results obtained in both hybrid models will be explained. These models are obviously more powerful than the model using overlapping of data, as they are a extreme simplication of it, which is doing only one training step per sample and using a batch size of one. They are also more powerful than the basic model, as this one should be able to perform, at least as good as it, due to the use of more recent data to predict each sample. After the results obtained and explained in previous sections, it is decided to discard the forecasting model, and analyze these hybrid models with both a sliding window and averages as inputs. When running these networks, one of the rst things to note is that in gen-eral the number of actions performed in both models when short series are used is very low, generally not more than 5 actions out of the possible 150. To mit-igate this, what has been done is decrease the threshold used to consider an action bullish or bearish, which was ±0.65%, to ±0.60%, and run some more experiments. This way, more actions would be performed, as the number of keep actions is decreased, for instance, an increase of 0.62% that would have been considered as a keep action, now is considered a buy action. Thresholds lower than 0.6 were tested as well, but they were mostly leading to more unstable 34
  • 39. networks where the total of the actions were either buy or sell, but not a com-bination of both with a proper learning of patterns. In Figure 17 the number of performed movements is shown for dierent starting dates through time, where is demonstrated how the number of performed movements increases with older starting dates, tending to a maximum average of around 100, which is exactly two thirds of all the possible actions. 120 100 80 60 40 20 0 2010 2010.5 2011 2011.5 2012 2012.5 Figure 17: Number of performed actions against dierent starting dates of the series. A new issue that comes up with this models is that the condence threshold used in previous models is not that reliable now, as dierent networks are used to predict each of the samples. Compare dierent networks that have been trained with dierent models and a dierent number of epochs is not that simple. In general all the actions will be performed without bounding the condence, as not so many are normally carried out, and in section 6 a method for choosing when to perform the action or stay away from the market will be shown, although not from a technical point of view. 35
  • 40. 5.6.1 With explicit validation The rst of the hybrid models is explained in section 4.3.1, which was very similar to the basic model, but generating a new network for each new sample, using data as recent as possible. As mentioned earlier, the main disadvantage is the time taken searching for the ideal parameters, one per sample in the second validation period. The experimentation was reduced a little bit by using parameters not too dierent from the good ones obtained in the basic model, in order to nish the experiments in a reasonable amount of time. The range of parameters used was the following: the thresholds used for separating the classes were ±0.60% and ±0.65%; the starting date oscillated between July 2009 and July 2012; window sizes between 40 and 140 values and few of the best patterns in the averages model; hidden layer sizes between 10 and 100 neurons; learning rates between 0.003 and 0.011 and momentums lower or equal than 0.05. The parameters performing the best results in terms of the nal amount of money were the following: starting date, July 2010; window size of 80 neurons and hidden layer of 25; learning rate of 0.003 and a momentum of 0.005. The average best epoch of this model's executions was 791 epochs, making a total of $149 in 106 movements, a rate of 0.38% per movement and a success rate of 63.2% In total, 794 experiments were run, with pretty good results, as 577 con- gurations managed to earn money, while, 217 ended up with a balance lower than $100, meaning that the 72.77% of the experiments were positive, while in previous experiments this percentage was closer to 50%. One of the most inuent parameters is the starting date, and according to it the results set was bounded to keep experiments just which starting dates from November 2011, January 2012, and February 2012. Setting the class threshold to ±0.60% as well, the number of experiments gets reduced to 288, with only 10 of them ending up with a negative economic balance. When choosing the best result, a group of experiments got outstanding re-sults, formed by a window size of 140, hidden layer of 20, momentum of 0.05 and dierent starting dates and learning rates. Six experiments are in this set, with the same results, $104.5 in 5 movements, a very good ratio of 0.88% per action. The ratio is outstanding, but take a conguration with such small num-ber movements might be risky, mainly when a priori better options show up in the rest of the set. Finally, the sample tagged as best was the one were each parameter per-formed its best in the total set. These parameters were as follows: starting date, January 2012; window size, 80 neurons; hidden layer size, 60 neurons; learning rate of 0.07 and a momentum of 0.05. This execution managed to earn a total of $13.6 out of the initial $100 in 32 movements, meaning a benets rate of the 0.4% and a success rate of 65.6%. The average best epoch of the 150 trained networks for the prediction was of 391.6 epochs. 36
  • 41. When applying these ideal parameters to the test dataset, the starting date is moved to August 2012 in order to keep the series length constant, and the results obtained are as follows: Average best epoch, 152.24; a total of $102.52 meaning a benet of $2.52 in 5 movements; a benets rate of 0.505% and a success rate of 80%, 4 out of 5 actions were right. 5.6.2 With implicit validation The last of the results' experiments correspond to the second hybrid model where the use of the rst validation model is completely skipped when training the net-works used to predict new values, as explained in section 4.3.2. Apart from the already explained advantage of using data for training which is chronologically closer to the values that are predicted, another positive fact of the present model is that the results obtained in the second validation dataset should be more re-liable. This is because the network that minimizes the cross-entropy error is not used, but its features are applied to a dierent set of data, which is increasing the importance of both the parameters and the number of training epochs. The-oretically, this is a good rst step to mitigate the problem of the extrapolation of results present in other models, and that will be explained in section 7.3, as something like a pre-testing phase is being done before the actual test of the results. With the current model, the ranges of parameters used for scanning are pretty much the same as in the hybrid model with explicit validation, as theo-retically the results should not be too dierent. Similarly to the previous model's results, very few actions are performed when ±0.65% is the threshold used to separate the three types of classes, so again the most of the experiments will be performed using ±0.60% as a divider threshold for deciding the actions. When coming to the actual results, this models shows fairly good general results, as 182 out of 204 experiments obtained a positive rate, meaning that the 89.2% of the experiments managed to gain some money, while the 10.8% ended up with less than the initial amount. Again, the results show certain reliability on this method, as the dierences between similar parameter congurations are very smooth. Also, the average cross-entropy error is low, being the maximum not more than 1.08, while in other models it was not uncommon to have samples with extremely high errors meaning that no convergence was present. The results improve when only experiments using as their starting date the 1stNovember 2011 are taken into account. In this new set of 50 results the worst of the cases managed to get $103.9 in 32 movements, meaning a rate of 0.12% per movement and its cross-entropy error was of 1.0602. In this set obviously the 100% of experiments ended up with a money gain, as the minimum gain was $3.9, while the maximum amount of money was obtained with the following parameters: starting date, November 2011; window size, 90 values; hidden layer size, 40 neurons; learning rate of 0.09 and a momentum of 0.05. With this parameters the cross-entropy error went down to 1.057, and the amount of money obtained was of $112.6 in 22 movements, meaning a benets ratio of 0.55%. The success rate was of the 63.6%, and the average best epoch while 37
  • 42. training was of 294 iterations. As in previous sections, these parameters need to be used in the test dataset, with the only dierence of the starting date, which is moving from November 2011 to June 2012 in order to keep the length of the series. When testing the series, there is a big drop of the results, as the money obtained was of $100.3, but using only one movement out of the 90 possible actions, which means that the ratio of benets per action is not too bad, 0.3% per movement. In this case, the cross-entropy error went up to 1.063, which is still good while the average best epoch was of 78 iterations per network trained. The results are not as good as expected after the optimism generated in the validation period, mainly due to the number of actions is too low, as in some previous cases. Finally, after all the models have been considered and the results showed, and as a continuation of the summary explained in section 5.5, it can be said that in general terms the hybrid models have been more reliable than the basic ones. Simpler models as the basic one or the one modied for using averages as inputs have managed to get more money in both the validation and test datasets, but the hybrid models have managed to keep the cross-entropy error lower avoiding irregularities. Also more networks were taken into account for each series execution, meaning that punctual good experiments might be obtained luckily, but this is not that likely to happen when considering the average of 150 networks. Lastly, and because of the problem's nature, the reliability is something extremely critical, meaning that the models with a more practical application would be the hybrid ones. Concretely the one using an implicit validation dataset, as its results' might be more easily trusted due to the use of an extra series of values for training, which is giving something similar to an extra test phase. 38
  • 43. 6 Combination of models Previous sections have shown how one single nancial series can be predicted using dierent methods, with their corresponding results. It has been demon-strated that normally when not many actions are performed in a series the results tend to be better than when using a lot of them, either choosing the ones by the condence or by reducing decision boundaries for buy/keep/sell classes. Table 4 is clearly illustrating this. But, what if more actions want to be performed without losing performance? One of the solutions applicable in a real case would be the use of more than one series, and this is what is going to be explained in this section. The series used to evaluate all the methods of the present document was Abengoa Abertis (ABE.MC in yahoo nance), as mentioned at the beginning of it. In order to expand the experiments the rest of the series included in the IBEX35 will be used, as they have close behaviors due to their strong inuence by the Spanish economy. The idea of this section is to show a simple demonstra-tion of how to combine dierent series in a practical way, so the training will not be as deep as in previous sections, and the parameters used for the networks' training of the dierent series will be the same for them all. The list of series considered represented by their yahoo nance codes is the following: ABE.MC BME.MC GAS.MC MAP.MC SAN.MC ACS.MC CABK.MC GRF.MC MTS.MC SCYR.MC AMS.MC DIA.MC IAG.MC OHL.MC TEF.MC ANA.MC ENG.MC IBE.MC POP.MC TL5.MC BBVA.MC FCC.MC IDR.MC REE.MC TRE.MC BKIA.MC FER.MC ITX.MC REP.MC VIS.MC BKT.MC GAM.MC JAZ.MC SAB.MC Table 5: List of the stock market series' codes considered in this model. The IBEX35 index is composed by 35 dierent stocks, but as ABG-P.MC stocks are not participating in the Spanish exchange market for more than two years, the easiest solution was to stop considering it, as the corpus of 34 series is big enough. When starting the technical part the rst issue to appear comes from the splitting of data. For the initial series dierent thresholds as ±0.60 or ±0.65 were tested manually, but now some problems appear when using the same threshold for all the series, as some good threshold might mean something completely dierent for another series. The solution applied for this issue consists in moving automatically the threshold for each starting date of each series in order to minimize the standard deviation of the three classes (buy, keep, sell). This is done by an small algorithm that assumes the greatest of the series' dierences in absolute value as the starting threshold and moves is down repeatedly until the standard deviation is minimum. For instance, assuming the following simple series shown in Table 6: 39
  • 44. Day Dierence 1 1.5% 2 -0.6% 3 0.0% 4 -1.1% 5 0.7% Day Dierence 6 1.9% 7 -3.6% 8 -0.9% 9 4.0% 10 1.2% Table 6: Example series with daily variations through 10 days. In the example above a series of ten values is considered, so this would be the number of iterations needed. The algorithm would proceed as Table 7 shows: Iteration Threshold Buy Keep Sell StD Best StD Best Threshold 0 ±1 0 10 0 5.77 5.77 ±1 1 ±4.0 1 9 0 4.93 4.93 ±4.0 2 ±3.6 1 8 1 4.04 4.04 ±3.6 3 ±1.9 2 7 1 3.21 3.21 ±1.9 4 ±1.5 3 6 1 2.52 2.52 ±1.5 5 ±1.2 4 5 1 2.08 2.08 ±1.2 6 ±1.1 4 4 2 1.15 1.15 ±1.1 7 ±0.9 4 3 3 0.58 0.58 ±0.9 8 ±0.7 5 2 3 1.53 0.58 ±0.9 9 ±0.6 5 1 4 2.08 0.58 ±0.9 10 ±0.0 6 0 4 3.06 0.58 ±0.9 Table 7: Example execution of the algorithm for choosing the classication threshold of a series according to the standard deviation of the classied sam-ples. In the example can be seen that the threshold performing the best distribu-tion of data is 0.9, which is bringing the standard deviation down to 0.58, and it would be the one used for this ten values series. Once the splitting of classes is understood the methodology for choosing the action to perform will be considered. Few dierent formulas will be tested, start-ing from simple ones as just taking the action with the maximum condence, to more complex ones, as using the last 50 actions and calculate the nal money by simulating the series, or the average benets ratio. These dierent approaches will be explained more in depth in the following subsection. As a last concern of this model's preparation, the training methods comes up to scene. The same methods used for previous models will be applied here, given a set of parameters, all the samples forming the validation dataset will be predicted and their results summarized. The average of these result's summaries will be used as a measurement for the given set of parameters, in order to choose the most suitable set of them. 40