SlideShare a Scribd company logo
1 of 46
Download to read offline
Time Series Prediction with Reservoir
Computers using a Delay Coupled Non-Linear
System
Henning Lange
January 26, 2012
1
Contents
1 Introduction 4
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Stock price as a time series . . . . . . . . . . . . . . . . . . . . 4
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Stationarity in time series 7
2.1 What is stationarity? . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Different types of non-stationarity . . . . . . . . . . . . . . . . 7
2.2.1 Non-stationarity in mean - NARMA10 . . . . . . . . . 8
2.2.2 Non-stationarity in variance - NARMA10 . . . . . . . . 9
2.2.3 Non-stationarity in mean and variance - NARMA10 . . 9
2.2.4 Natural economic time series - Google Stock price . . . 10
3 Mackey Glass reservoir 11
3.1 What is reservoir computing? . . . . . . . . . . . . . . . . . . 11
3.1.1 Internal states of the reservoir . . . . . . . . . . . . . . 12
3.1.2 Output generation and training . . . . . . . . . . . . . 13
3.1.3 The reservoir computing paradigm . . . . . . . . . . . 13
3.1.4 The advantages of reservoir computing . . . . . . . . . 13
3.2 Mackey Glass . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 How are inputs fed into the system? . . . . . . . . . . . 15
3.2.3 How is the reservoir input non-linearly transformed? . 16
3.2.4 How are virtual nodes interconnected? . . . . . . . . . 19
3.2.5 How are is the output generated and how are the out-
put weights trained? . . . . . . . . . . . . . . . . . . . 21
3.2.6 Why is weak stationarity a neccessary condition for
learnability with LSMs? . . . . . . . . . . . . . . . . . 23
4 Detrending techniques 23
4.1 What happens when no detrending is employed? . . . . . . . . 23
4.1.1 Results - no detrending - non-stationarity in mean . . . 24
4.1.2 Results - no detrending - non-stationarity in variance . 25
4.1.3 Results - no detrending - non-stationarity in mean and
variance . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.4 Results - no detrending - natural economic time series . 26
4.1.5 Summary results - no detrending . . . . . . . . . . . . 28
4.2 On the expressiveness of the results . . . . . . . . . . . . . . . 28
4.3 Bipolarized target . . . . . . . . . . . . . . . . . . . . . . . . . 30
2
4.3.1 Results - bipolarized target . . . . . . . . . . . . . . . 30
4.4 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4.1 Results - differencing . . . . . . . . . . . . . . . . . . . 32
4.4.2 Implicit assumptions when differencing . . . . . . . . . 33
4.5 Log differencing . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5.1 Results - log-differencing . . . . . . . . . . . . . . . . . 34
4.6 High-pass filter . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6.1 What kind of non-stationarities are induced by low-
frequency oscillations? . . . . . . . . . . . . . . . . . . 35
4.6.2 Characteristics of the high-pass filter at hand . . . . . 37
4.6.3 Reconstructing the time series . . . . . . . . . . . . . . 38
4.6.4 Results - high-pass filtering . . . . . . . . . . . . . . . 38
5 Conclusion 40
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 On the natural time series . . . . . . . . . . . . . . . . . . . . 41
A Normalized root mean square error 43
B On NRMSE and correlation 43
C Signum function 43
3
1 Introduction
1.1 Motivation
Predicting time series has always been of great interest. There seems to be
an abundance of scenarios in which knowledge of future values of a time
series would be very desirable. In addition to finance where the incentive
to predicting time series is obvious, social sciences, ecology or meteorology,
to name a few, provide other examples in which tools for predicting future
values of time series are desired. Statistical models, such as autoregressive
or moving average models, are typically employed for this task.
Reservoir computing, a novel recurrent neural computation framework, is a
promising alternative to stastical models for a number of reasons. Classi-
cal stastical models (i.e. autoregressive or moving average models) can only
capture linear relations between past and future values whereas in reservoir
computing non-linear relations can be learned by linear training methods. To
be more precise, when predicting out of auto-structure, computations with
reservoirs allow the discovery of linear relations of future values with a pool
of memory dependent non-linear transformations of past values. But, as we
will see in more detail later, a neccessary condition of a time series to be
predicted by the means of reservoir computing is stationarity.
This thesis will investigate different techniques of transforming non-stationary
time series to stationary time series and their impact on the learnability with
Mackey-Glass reservoirs empirically.
1.2 Stock price as a time series
There is a natural incentive in wanting to predict stock prices and, what is
more important for this thesis, they often exhibit non-stationarity. Because
of this, they will serve in addition to three artificial tasks which will be de-
fined later as a basis of this empirical investigation. Furthermore, in order
for the prediction of a future value of a time series to be useful, it is often
sufficient to make short term predictions. For example, the knowledge of the
stock price in 1 minute would be enough to make significant profits. Math-
ematically speaking, when predicting stock prices all information at time t
and prior to t can be used to predict the stock price at time point t + 1 and
what counts is often not the perfect prediction of the value, but the general
direction (whether or not the price will increase or decrease) is sufficient. As
we cannot simply predict the stock value because of the trend, a detrended
transformation of the stock value must be predicted, but the real stock price
must be inferable from the detrended prediction. This characteristic which
4
will also be imposed on the artificial tasks poses a direct requirement for the
detrending techniques, namely that the prediction of the real value of time
t + 1 (or at least a prediction of the direction of the real value) must be re-
constructabe solely by information available at time t. In other words, there
must exist a function f−1
, which only uses information available at time t,
such that f−1
(f(yt)) = yt, if f is our detrending function and y is the time
series to predict.
On top of that, the property that predicting the exact value is not neccessary,
typical performance measures for evaluating the prediction, such as the nor-
malized root mean square error (NRMSE) (see appendix A) or correlation do
not always make very much sense. It is easy to construct scenarios in which
a time series with a lower correlation or higher NRMSE would generate more
profit than another time series with greater correlation and lower NRMSE
if one would place bets on them in a stock exchange setting (see appendix
B for a more thorough explanation). In order to capture this property, a
performance measurement is introduced which is closely related to the aver-
aged gain (or loss) that betting on the prediction would have generated in a
stock exchange scenario where in each time step 1 stock is exchanged. Note
that all transaction or similiar fees are neglected. Let sgn(x) be the signum
function (see appendix C), yt and yt be the time series and its prediction
respectively, we define the average potential gain apg as
apg(yt, yt) =
1
T − 1
T−1
t=1
(yt+1 − yt)sgn(yt+1 − yt), where apg values below 0
denote a potential loss and values above 0 a potential gain. sgn(yt+1 −yt) can
in principle replaced by anything that conveys the information whether the
model predicts that the time series will increase or decrease. apg(yt, yt) is pos-
sibly unbounded and more volatile time series can produce higher apg values
whereas the perfect prediction of a constant time series will only yield an apg
of 0. In order to overcome this problem, we define rapg(yt, yt) =
apg(yt, yt)
apgmax(yt)
with apgmax(yt) =
1
T − 1
T−1
t=1
(yt+1 −yt)sgn(yt+1 −yt). rapg(yt, yt) is bounded
by -1 and 1 and it can in principle be used to compare the model perfor-
mance across different time series. A more fine-grained model of the stock
exchange scenario would allow buying in principle any amount of stocks, thus
the potential gain in every time step would be the quotient instead of the
difference of the time series at successive time steps. The problem with that
approach is that it becomes nonsensical for negative values and the artificial
tasks may contain negative values.
5
1.3 Overview
The goal of this thesis is to investigate different detrending techniques and
their impact of the learnability with a special type of reservoir computer.
As this investigation is empirical, the performance resulting from detrending
and predicting has to be measured given specific tasks. As stated above, four
different tasks will be employed, namely three different artificial tasks where
the respective type of non-stationarity is known and one ”natural” task: pre-
dicting future values of a stock. The next chapter will deal with the question
what stationarity means. Furthermore, different types of non-stationarity are
analyzed and example processes which exhibit these types of non-stationarity
are given which will later be used to evaluate the different detrending tech-
niques. In chapter 3, the question what reservoir computing is and how it
works are adressed. A special type of reservoir was employed in this the-
sis where spatial multiplexing is substituted by temporal multiplexing which
allows the computations to be carried out by a single node which in turn
can be simulated by a laser. This technique has certain implications which
are also adressed in chapter 3. The forth chapter is concerned with different
detrending mechanisms and their impact on the performance of the model
will be evaluated. The chapter can roughly be divided into 3 parts, namely
bipolarization, differencing and high-pass filtering. The performances of the
different detrending techniques are also discussed in chapter 4. The fifth and
last chapter summarizes the findings of this thesis and gives a conclusion.
6
2 Stationarity in time series
2.1 What is stationarity?
Based on [2], in order to understand the concept of stationarity, time series
which are typically understood merely as a sequence of data points at a fixed
temporal interval need to be viewed from a different perspective. For the
notion of stationarity, it is useful to view a time series as a realization of
a stochastic process. From this perspective, a time series is a sequence of
(dependent) random variables, thus a time series of length T can be seen as
a T-dimensional probability distribution, denoted by {Xt}T
1 . Hence, every
data point xt can be seen as a sample of Xt and every random variable Xt
is associated with a mean µt and a variance σ2
t which are typically unknown
because they cannot be inferred from a single realization.
In this sense, a time series is said to be strictly stationary iff the joint prob-
ability distribution of any sets of time points t1, t2, ..., tm is constant over
time, thus ∀kP(Xt = xt, ..., Xm = xm) = P(Xt+k = xt, ..., Xt+m+k = xt+m).
Weak (or wide-sense) stationarity is a special case of strong stationar-
ity. A time series is said to be weakly stationary iff the joint probability
distribution of any 2-element set of time points tm, tn is constant over time,
thus ∀kP(Xn = xn, Xm = xm) = P(Xn+k = xn, Xm+k = xm). Self-evidently,
strong stationarity implies weak stationarity and weak stationarity implies
in turn, that the mean and variance of the time series are constant over
time and that the covariance is only dependent on the shift in time k, thus
Cov(Xt, Xt+k) = γk = constant. Note that by assuming weak stationarity,
the mean, variance and covariance function may be estimated from the real-
ization since we assume that they are constant over time and thereby stem
from the same probability distribution.
As we will see later, a neccessary condition for a time series to be modelled
by the means of reservoir computing is weak stationarity.
2.2 Different types of non-stationarity
The two most common types of non-stationarities are based on [8] non-
stationary mean and non-stationary variance. Economic time series which
are non-stationary in variance are also often non-stationary in mean. In
the next part, three different tasks are defined who specifically exhibit non-
stationarity in mean, non-stationarity in variance and non-stationarity in
mean and variance. All tasks are a derivations of the NARMA10 (Non-linear
Autoregressive Moving Average of order 10) task which was introduced in [10]
and has become a benchmark for reservoir computing tasks. Diverging from
7
predicting time series out of the auto-structure which basically means that
the input of the system and the target are the same time series but shifted by
1 time step, in the NARMA10 task the input u is a series of random numbers
drawn from a uniform distribution from the interval [0 0.5] and the target is
defined by the recursive function yt+1 = 0.3yt + 0.05yt(
9
i=0
yt−i) + 1.5utut−9,
where ut is the input at time t.
2.2.1 Non-stationarity in mean - NARMA10
Figure 1: Instance of NARMA10 with non-stationary mean. The blue graph
depicts the NARMA10 target after a linear trend is added. The red graph
depicts the variance of a 500 timesteps long bin. The green line denotes the
mean of the entire target. One can easily see that the target in the first part
is significantly smaller than the values of the second part which leads to the
conlusion that the mean does not remain constant over time.
In order to induce non-stationarity in mean, a linear trend is added to
8
the existing task. The input of the system remain random numbers from a
uniform distribution in the interval [0 0.5] but the target values are altered:
y1
t = yt + 0.0001t, where t ∈ 1..7000 denotes the position in the target. Note
that the linear trend is added to an existing NARMA10 target and is not
propagated further by recursion. See Figure 1 for a plot of a realization of
such a new target with non-stationary mean. Visual inspection makes the
fact that the mean of the target increases over time and thus is not constant
apparent. The target values of the first part are often below the overall
mean whereas target values of the second part are often above the mean.
The variance is still stationary and is plotted in bins of 500 time steps each.
The fact that the variances of different bins fluctuate is explainable by the
stochastic nature of the process and is still deemed to be constant.
2.2.2 Non-stationarity in variance - NARMA10
To induce non-stationarity in variance, two alterations to the existing NARMA10
task have to be carried out. First the mean of the target has to be shifted to 0
to make it invulnerable to the later alterations which can be done by altering
the interval of the uniform random inputs u to [-0.5 0.5]. Second, in order
to induce a time dependence of the variance the target values are multiplied
by a time dependent term: y2
t = yt(0.2 + t/7000), where t ∈ 1..7000 again
denotes the position in the target. Again, the alterations to the target are
carried out to the existing NARMA10 and the alterations to a single target
value are not propagated to other values by recursion! Figure 2 shows the
plot of a realization of such a new target. The red graph denote the variance
of a bin of 500 time steps. One can easily see that the variance increases over
time, thus it is not constant. The mean still seems to be constant over time.
2.2.3 Non-stationarity in mean and variance - NARMA10
The last of the artificial benchmark tasks is to exhibit non-stationarity in
mean and variance and it is a combination of the above defined two. The
target is made non-stationary in variance by the same steps as above, namely
shifting the mean to 0 by altering the interval from where the inputs are
drawn from to [-0.5 0.5] and by multiplying the resulting target by a time
dependent term. After that, in order to induce non-stationarity in mean a
linear trend is added, thus y3
t = yt(0.2 + t/7000) + 0.0001t with t ∈ 1..7000
again denoting the position in the target. Figure 3 depicts such a new target.
The variance as well as the mean increase over time, hence they are not
constant.
9
Figure 2: NARMA10 with non-stationary variance. The target (blue graph)
is altered in a way that it is not stationary in mean anymore. The target was
divided into 14 bins `a 500 timesteps each. The red graph denote the variance
of the respective bin, whereas the green line depict the overall mean.
2.2.4 Natural economic time series - Google Stock price
As already stated above, in order to evaluate the performance of the different
detrending techniques, one natural economic time series will be modelled.
Predicting the minutely closing values of the Google stock from 1st June
2011 19:10 until 29th June 2011 19:59 in total comprosing 7848 data points
will be used for this task. Figure 4 depicts the first 7000 data points of the
time series. The fact that the mean is not constant is apparent, since the first
half of the time series is clearly bigger than the mean, whereas the second
half is smaller than the mean. The variance also seems to vary over time,
since it seems to fluctuate heavily.
10
Figure 3: NARMA10 target (blue graph) which exihibts non-stationarity in
mean and variance. The overall mean is visualized by the green line. The red
graph again denote the variance of the respective time bin. One can easily
see that the mean as well as the variance are not constant.
3 Mackey Glass reservoir
3.1 What is reservoir computing?
Reservoir computing is a novel type of recurrent neural networks which try
to model spatiotemporal processing in cortical networks. It emphasizes the
importance of temporal structure in information. Reservoir computing it-
self is not an algorithm but a framework and subsumes different instances of
reservoir computing algorithms. Echo State Networks [16] and Liquid State
Machines [17] seem to be the most prominent members of the reservoir com-
puting family. All instances of reservoir computers share that they consist of
a random but fixed recurrent neural network which is also called reservoir.
The reservoir consists of non-linear interconnected nodes that are driven or
11
Figure 4: The frist 7000 data points of the Google stock price in June. The
variance (red graph) as well as the mean do not seem to be constant.
excited by the input. The weights between those reservoir nodes are selected
randomly and remain fixed. The reservoir is connected to a read-out neuron
whose weights are adapted during the training phase. The read-out neuron
is typically a linear node.
3.1.1 Internal states of the reservoir
Because of the recurrent nature of the reservoir, when it is excited by an
input, its future state is dependent on the internal state prior to the input
and the input itself. To give a graphic explaination one can make an analogy
with a liquid. Imagine the surface of a liquid which was excited by dropping
pebbles of different shape and weight into it. It is covered in ripples. The
ripples, their direction and speed, comprise the internal state of the reservoir.
Imagine now that a new pebble (input) hits the surface of the liquid. The
new internal state depends on the old state and the characteristics of the
12
pebble. The ripples, their direction and speed, contain information not only
about the last pebble thrown into the liquid, but also fading information
about past pebbles (inputs). This characteristic is called fading memory.
3.1.2 Output generation and training
The output is generated by a read-out neuron which is connected to a subset
of the reservoir nodes. The read-out neuron is typically a linear node and its
weights are the only ones which are adapted during training. During training
the weights of the read-out unit are adjusted in some way that the error
between the actual output, generated by the internal state of the reservoir
and the read-out weights, and the desired output (or teacher output or target)
is in a way minimized. Different techniques like for example Perceptron
Learning, Generalized Linear Models, or maximum margin techniques like
Support Vector Machines can be employed.
3.1.3 The reservoir computing paradigm
Generally speaking, a reservoir computer takes an input stream ut and maps
it onto an output stream yt which can also be refered to as target stream or
simply target. Such a mapping is called a filter in engineering. To be more
precise, a reservoir computer can be seen as a cascading of two filters: the
first non-linear filter being the reservoir, the second most of the times linear
filter being the read-out mapping. Filters can exhibit certain properties and
it has been shown that reservoir computers can in principle approximate ev-
ery time invariant filter with fading memory.[11] Classical stastical models
for time series assume a relationship between a future value and past val-
ues of the time series. Relationships are mostly found in close vicinity and
values in the distant past do not influence future values most of the time.[8]
This characteristics is in accordance with with the fading memory property
of reservoir computers. Classical neural network approaches are not able to
naturally incorporate such a memory dependence, although it would be pos-
sible to spatialize the temporal dimension, i.e. additionally feeding the input
of times t − 1, ..., t − n at time t into the network. So all in all, reservoir
computing approaches seem to be highly suitable for forecasting time series.
3.1.4 The advantages of reservoir computing
Additionally, reservoir computing offers great advantages from a computa-
tional machine learning point of view. Most of the classical artificial neural
network approaches work on a discrete time scale [4] whereas reservoir com-
puting as we will see in more detail later allows in principle for computa-
13
tions on continuous input streams. Furthermore, consider the error function
of multi-layer backpropagation networks which often exhibit multiple local
minima. Also consider simple single-node classifiers (e.g. perceptron or Sup-
port Vector Machines) which are often found unable to seperate inputs in
a low dimension.[4] According to the Cover theorem [6] the probability of
the separability of inputs increases with the number of dimensions the input
is projected to in a non-linear fashion. The reservoir weights are fixed and
it serves as a memory dependent non-linear transformator. As the read-out
is connected to (a subset of) the reservoir nodes and the number of nodes
it is connected to should exceed the number of dimensions of the reservoir
input, the reservoir can be seen as a non-linear memory dependent kernel
which helps the read-out classifier to seperate inputs by non-linearly pro-
jecting the input into a higher dimensional space. To sum up, in theory
reservoir computing overcomes the problems of multi-layer backpropagation
networks (local minima in error function) and simple single-node classifiers
(linear seperability).
3.2 Mackey Glass
3.2.1 Introduction
A special kind of reservoir was employed in this thesis. The reservoir com-
puting approach used in this thesis is an instance of a Liquid State Machine
(LSM) and it originates from the PHOCUS project. PHOCUS is an acronym
which stands for towards a PHOtonic liquid state machine based on delay-
CoUpled Systems. The computational nodes of the system are delay coupled
dynamical systems which will execute the non-linear transformations for the
reservoir. The following chapter is based on [3].
In ordinary reservoir computing, the topology of the reservoir is often ran-
dom but fixed. Computational units are often sparsely connected and there
are basically no constraints to the spatial topology of the network. The acti-
vation of all reservoir nodes at a certain time in point t make up the state of
the reservoir at t. The number of reservoir nodes determine the number of
dimensions the input is projected into. In the PHOCUS LSM however, spa-
tial multiplexing is substituted by temporal multiplexing which in principle
means that all computations are carried out by a single node (at the same
point in space) at different points in time whereas in classical LSMs multiple
nodes carry out the computations at the same time at different positions in
space. This technique allows the computations to be carried out by single
laser but also imposes some limitations on the topology of the resulting net-
work. All PHOCUS reservoirs show a special ring structure of virtual nodes.
14
One might now have the impression that the maximum number of dimen-
sions the input can be projected into is 1, since the number of computational
units is 1. If that were the case, the computational power of the system was
desastrous since according to [6] the higher the number of dimension an input
is projected into the higher the chance of linear seperability. This problem is
overcome by considering another temporal dimension of the system and by
introducing virtual nodes. As stated above, in classical reservoir computing,
every point in time t is associated with a reservoir state x(t) containing the
activations of all reservoir nodes at time t. In the PHOCUS LSM, due to
temporal multiplexing the state of the reservoir x(t) is induced by the con-
tinuous state of the computational node in the interval (t−1)∗τ ≤ s ≤ t∗τ,
such that the state of the single computational unit during one τ makes up
one reservoir state. The PHOCUS LSM consists of virtual nodes. They are
called virtual because their state is simply the time delayed state of the single
computational node. During one τ the single computational node carries out
the computation of every virtual node. Thus, if the system consists of N
virtual nodes, one τ is divided into N time frames of length θ =
τ
N
. θ is also
called the virtual node distance.
Figure 5 shows an exemplary state of the computational node during one
τ. One τ induces one reservoir state. The activation of the nth virtual node
is dependent on the state of the computational node during the nth θ.
3.2.2 How are inputs fed into the system?
In classical LSMs the input u(t) of time t is fed in parallel to all nodes. Due
to temporal multiplexing in the PHOCUS LSM, the input is fed in a serial
manner to the different virtual nodes, that is, to the computational node.
The continuous input stream u(t) at time t is strechted using a sample&hold
technique to the length of τ, such that I(k) = u(t) for k ∗ τ < t < (k + 1) ∗ τ.
The input weights of classical LSMs are encoded by a mask function in the
PHOCUS LSM. The mask function M(k) is a piece-wise constant over the
length of one virtual node θ and periodic over τ, such that M(k+τ) = M(k).
The values of M(k) over one θ are taken radomly from some probability
distribution: for θ ∗ i < k < θ ∗ (i + 1) M(k) = Wres
in,i. This ensures that the
input weight for a single virtual node is constant over different inputs.
If the input is one-dimensional, the input to the computational node is given
by J(t) = I(t) ∗ M(t). If the input is multidimensional, a mask is created
for every input dimension and the input of the computational node is given
by the sum of the product of the masks and the respective input dimension:
J(t) =
j
Ij
(t) ∗ Mj
(t). The different input dimensions are collapsed onto a
15
Figure 5: Examplary state of the computational node during one τ. If N is
the number of virtual nodes the reservoir consists of, then one τ is divided
into N (in this case 10) bins. The state of the computational node at the
end points of those bins depicted by the dashed lines denote the activation
of the respective virtual node.
one-dimensional stream J(t) which may cause information loss. In order to
avoid that, one should enforce a constraint on the input weights (the different
values Mj
(t) can take), namely that, if you let A be a matrix and Ai,j denotes
the ith input weight of input dimension j (Ai,j = Mj
(i ∗ θ)), then rank(A)
should be equal to the number of input dimensions. This constraint avoids
information loss since the every input stream uj
(t) could be reconstructed
given A and J(t).
3.2.3 How is the reservoir input non-linearly transformed?
The virtual nodes of the system are, as stated above, delay coupled dynamical
systems. In order to understand how the computational node non-linearly
transforms the reservoir input, one has to understand what delay-coupled
dynamical systems are and how they function.
Based on [9], there are two types of dynamical systems: iterated maps and
differential equations. For this thesis, knowledge about the latter is sufficient.
A differential equation describes a function where the value of the function
16
Figure 6: Visualization of the masking process for an exemplary one dimen-
sional input [-1,2,-2.5,1]. First the input is strechted to the length of τ (80 in
this example). Then it is multiplied by a, in this case, bipolar mask which is
periodic over τ and piece-wise constant over θ. The dashed lines denote the
beginning of a new τ.
is deterministically related to the derivative of the function, it, so to say,
defines the rules of the evolution of a point in space over time. The space
the point evolves in is called phase space. A typical example taken from the
field of biology for a differential equation models the population of a certain
species. The growth rate of a population (derivative of the total population)
depends on the size of the total population. More cats create more kittens :).
If, for example, the population increases by 10% of the total population in
one time step, this could be expressed by f (t) = 0.1f(t). A natural question
which might arise in the context of this simple scenario could be for example,
what the population in s time steps is given an initial population f(t0) = x0
(Initial value problem). There are two different approaches to this problem.
First, one could analytically solve the differential equation which can often
be done by integration and compute the solution. This is not always possible.
17
The second approach is to numerically approximate the solution. There are
many methods to do that. One of them is called ”Heun’s method”.
The general idea of Heun’s method is simple. If we want to approximate
the function at f(t0 + s) with f(t0) = x0, we could evaluate the derivative
at t0 and assume that it remains constant over s which leads to ˜f(t0 + s) =
x0 + s ∗ f (x0), where ˜f(x) is the approximation of f(x). It is obvious that
the derivative is in most cases not constant in the interval t0 ≤ t ≤ t0 +s and
we can refine the approximated solution by dividing s into n parts and itera-
tively compute ˜f(ti+1) using f (ti) with ti = ti−1 +
s
n
n times until ti = t0 +s.
By doing so, one only assumes that the derivative remains constant over
s
n
instead of s. From that follows that if s was divided into infinitely many
parts, the approximation ˜f(t0 + s) would be equal to real f(t0 + s). Another
refinement is to estimate the average value of the derivative in the interval
t0 ≤ t ≤ t0 + s. An admissable estimation of the average value of the deriva-
tive if f (t0) and f (t0 + s) was known, would be
1
2
(f (t0) + f (t0 + s)). The
problem is that in order to compute f (t0 +s), f(t0 +s) has to be known, but
f(t0 + s) is exactly what we wanted to compute in the first place. A remedy
to that problem is to estimate f(t0 +s) by ˜f(t0 +s), then use ˜f(t0 +s) to ap-
proximate f (t0 + s) and use this estimate to approximate the average value
of the derivative in the interval t0 ≤ t ≤ t0 + s. The two refinements can
be combined and the resulting method is a 2-staged iterative process: Let
ti+1 = ti +
s
n
and f(t0) = x0. Compute f (ti) using f(ti) starting with ti = t0
and use this to estimate ˜f(ti+1) = f(ti) +
s
n
∗ f (ti). Then estimate ˜f (ti+1)
using ˜f(ti+1) and then compute f(ti+1) =
1
2
(f (ti) + ˜f (ti+1)) ∗
s
n
+ f(ti).
Repeat those steps until ti = t0 + s.
Up to now, we have investigated ordinary differential equations (ODEs) but
the differential equation employed for the non-linear transformation in the
reservoir is a delayed differential equation (DDE). The difference between
an ordinary differential equation and a delayed differential equation is that
the derivative at time t of a DDE is not only dependent on the value of the
function at time t but also on delayed value(s) of the function. DDEs can
be employed where the cause and effect relation is delayed, for example in a
better model of the population size of a species where the sexual maturity of
the individuals of the species is accounted for. Only an adult cat can create
new kittens :).
The delayed differential equation responsible for the non-linearity in the reser-
voir is dependent on a single delayed function value, thus it is of the form
18
f (t) = g(f(t), f(t − τ)). The phase space of such a DDE is infinitely dimen-
sional because it depends on the continuous initial history in the interval
t0 − τ ≤ t ≤ t0. Since past values of f(t) are known in the context of this
application, the initial value problem can also be solved by Heun’s method.
The DDE responsible for the non-linear projections in the reservoir is the
Mackey-Glass equation but with a few changes:
f (t) = (
η ∗ (f(t − τ) + γ ∗ J(t))
1 + (f(t − τ) + γ ∗ J(t))p
−f(t))/T, where J(t) is the external reser-
voir input at time t from the definition earlier, η, γ, T and p are adjustable
parameters. The non-linear transformations of a single virtual node are in-
duced by ”following” the Mackey-Glass equation for a time θ (virtual node
distance).
3.2.4 How are virtual nodes interconnected?
Figure 7: The characterstic topology of a PHOCUS LSM. Strong weights
form a ring structure but basically all neurons are interconnected. Addition-
ally all neurons exhibit a self-loop.
In order to understand how the virtual nodes are interconnected, what
it means that two neurons are connected should be clarified first: neuron i
is connected to neuron j, if the activation of neuron i somehow influences
the activation of neuron j. Applying Heun’s method gives a very easy to
understand explanation on how the virtual nodes are interconnected by in-
ertia of the dynamical system: recall that the single computational node
executes the computations of every virtual node during one τ. One τ is di-
vided into N θs and the state of the ith virtual node is equal to the state of
19
the computational node at the end of θi. The computational node follows
the Mackey-Glass equation. Let us assume that the computational node has
followed the Mackey-Glass equation until the end of θi (thus, t = n ∗ τ + i ∗ θ
with n ∈ N) and suppose we now apply Heun’s method for following the
Mackey-Glass equation for θ, namely f(t + θ) = f(t) + θ ∗ f (t). The fact
that the (i + 1)th virtual node is coupled to the ith virtual node becomes
apparent when examning f(t + θ) = f(t) + θ ∗ f (t), since f(t) and f(t + θ)
are in this case the state of the computational node at the end of θi and θi+1,
i.e. the activation of the ith and (i + 1)th virtual node respectively. The
external reservoir input J(t) is injected by influencing f (t) (see definition of
f (t)). One might now get the impression that the (i + 1)th virtual node is
connected to the ith virtual node of the previous timestep since f(t − τ) is
also part of f (t) but one should consider that f(t − τ + ) with 0 < ≤ θ
denotes the activation of the (i + 1)th virtual node in the previous time step
and a more fine-grained way to compute f(t + θ) would predominately take
those values into account. Thus, the (i+1)th virtual node actually exhibits a
self-loop. Consider now, that the computational node computes the state of
the (i + 2)th virtual node. The (i + 2)th virtual is connected to the (i + 1)th
virtual node, but because the (i + 1)th virtual node is connected to the ith
virtual node, the (i + 2)th virtual node is also connected to the ith virtual
node but to a lot smaller degree. To sum up, the topology of a PHOCUS
LSM is very restricted, namely to a very specific ring structure where every
neuron is connected to itself and basically all neurons are interconnected, but
strong connections are present from the ith to the (i + 1)th virtual node.
Analytically solving the differential equation and bringing the activation
in the form of classical LSMs confirms these considerations. Let xi(k) be the
activation of the ith neuron at time k, then the activation is given by
xi(k) = e−iθ
xn(k − 1) +
i
j=1
∆ijf(xj(k − 1), u(k)) with ∆ij = (1 − θ)e−(i−j)θ
This equation shows that the weight between neurons exponentially decreases
with their distance. Furthermore, it shows that θ is of big importance for the
coupling. In addition to θ T (timescale of the MG equation) seems to be of
big importance. T scales the differential equation. If T is big, the step the
dynamical system takes in each time step is small and vice versa. Let us again
consider the simplest way to estimate the state of the dynamical system in
one θ: f(t + θ) = f(t) + θ ∗ f (t). With a growing θ, the influence of f (t) on
f(t+θ) increases, thus the degree of coupling to the previous node decreases
(the influence of f(t) on f(t + θ)). In this case, the influence of the self-loop
and the external input which are embedded in f (t) overshadow the coupling
20
with the previous node. If θ is too small, the external input and the auto-
coupling lose their influence and the coupling to the previous node becomes
too strong which results in very similiar virtual node states, since, colloquial
speaking, the differential equation did not have enough time to evolve. T
seems to be usable to counter those effects. Empirical investigations back
up these considerations and suggest that T = 5θ, θ = 0.2, τ = 80, γ =
0.05, η = 0.4 and N = 400 is a good setup which provides in addition to
reasonable coupling to the previous node and self-loop also good integration
of the external input.
3.2.5 How are is the output generated and how are the output
weights trained?
The state during one τ make up one reservoir state and the activation of
the ith virtual node is equal to the state of the computational node after θi,
thus if xi(k) is again the activation of the ith node at time k and f(t) is the
reservoir state at ime t, then xi(k) = f((k−1)τ +iθ− ) where is very small
compared to θ. accounts for the fact that the computations at i ∗ θ are
already responsible for the activation of the (i + 1)th node. The activations
of all reservoir nodes at time k can been seen as a vector x(k) and it has to be
mapped to the desired or teacher output y(k). In most applications a linear
transformation is employed. There are numerous linear models which can
in principle be used. The model used in this thesis is called General Linear
Model [1].
General Linear Models can be used under the assumption that the target
output y(k) is deterministically related to the reservoir state x(k) with some
error , thus y(k) = l(x(k), w) + . is in this case assumed to be Gaus-
sian noise, i.e. it has zero mean and variance σ2
(precision of β =
1
σ2
)
and l is a linear function of the form l(x(k), w) = w0 +
N
i=1
wi ∗ xi(k),
thus it is dependent on the model parameter vector w. Because of the
fact that the error stems from a Gaussian distribution, we can write for
the probability of a target y(k) given model parameters w, reservoir state
x(k) and precision β: P(y(k)|w, x(k), β) = N(y(k)|l(x(k), w), β−1
), with
N(y|µ, σ2
) =
1
√
2πσ2
exp{−
1
2σ2
(y − µ)2
}. It immediately follows that for
the expectation of the teacher output given a reservoir state holds that:
E(y(k)|x(k)) = l(x(k), w) which will be helpful when determining the output
of the system once the parameters w are learned.
21
The main question remains how w is extracted from the data. Imag-
ine the different teacher outputs y(k) are grouped into a column vector y
with yk = y(k), so the kth component of y equals y(k) and imagine that
the reservoir states are grouped into a matrix, where the kth row depicts
x(k). We want to maximize the probability of y given x, w and β. Given
that yk are independent and identically distributed, the joint probability is
given by the product of the marginal probabilities, thus: P(y|x, w, β) =
N
i=1
P(yi|xi, w, β). P(y|x, w, β) is called the likelihood function and what we
are trying is to find w = arg max
w
P(y|x, w, β). Since the logarithm is a
monotonic function, maximizing the logarithm of the likelihood equals max-
imizing the likelihood: arg max
w
P(y|x, w, β) = arg max
w
ln(P(y|x, w, β)) =
arg max
w
ln(
N
i=1
P(yi|xi, w, β)) = arg max
w
N
i=1
ln(P(yi|xi, w, β)). Plugging in
the definition of P(yi|xi, w, β) yields w = arg max
w
N
i=1
ln(
1
2πβ−2
exp{−
1
2β−2
(yi−
l(w, xi))2
}). Simplifications result in: w = arg max
w
N
2
ln(β) −
N
2
ln(π) −
N
i=1
β
2
(yi − l(w, xi))2
. It is obvious now that we can maximize the expression
with respect to w by minimizing E(w) =
N
i=1
(yi −l(w, xi))2
=
N
i=1
(yi −wT
xi)2
(we are omitting w0 in this case for simplicity, but one could imagine adding
a leading column of 1s in x, in such a way that x0 ∗ w0 = w0). This equation
is also called the sum-of-squares error function.
In order to compute the minimum of E(w), we apply well known tech-
niques to find the minimum analytically, namely we calculate the gradient
and set it to 0. We begin by simplifying E(w): E(w) =
N
i=1
(yi − wT
xi)2
=
N
i=1
(yi −
j=0
wjxi,j)2
=
N
i=1
y2
i −2yi
j=0
wjxi,j +(
j=0
wjxi,j)2
. For the gradient
in the kth direction now holds:
E(w)
wk
=
N
i=1
(−2yixi,k + 2xi,k
j=0
wjxi,j).
22
Setting this to 0 and dividing by -2 yields: 0 =
N
i=1
(yixi,k − xi,k
j=0
wjxi,j) =
N
i=1
yixi,k −
N
i=1
xi,k
j=0
wjxi,j =
N
i=1
yixi,k −
j=0
xT
∗,jx∗,kwj. Since the formula
for each component in w is the same, we can rewrite the equation above to:
0 = xT
y − xT
xw. This step can easily be verified by deriving what holds for
the kth row of 0 = xT
y − xT
xw. Solving for w now yields, w = (xT
x)−1
xT
y.
(ΦT
Φ)−1
ΦT
is also called the Moore-Penrose pseudo inverse of the matrix
Φ and can be seen as a generalization of the notion of matrix inverse for
non-square matrices.
3.2.6 Why is weak stationarity a neccessary condition for learn-
ability with LSMs?
The question why detrending is essential to be able to model time series
with LSMs has not yet been adressed. The answer to that question can be
answered by looking at the assumption which was made when deriving the
General Linear Model. In order to state the likelihood of the conditional
probability of y as the product of the marginal probabilities the assumption
that the yk are independent and identically distributed was made. If you now
recall the definition of weak stationarity, namely that mean, variance and
covariance functions are constant over time it becomes obvious why this is
a neccessary condition: if the time series is not stationary, either the mean,
variance or covariance functions must not be constant, thus they cannot be
identically distributed which is a necessity for the General Linear Model to
be applied.
4 Detrending techniques
4.1 What happens when no detrending is employed?
In order to establish a baseline for the performance of the detrending tech-
niques, the performance of the model without detrending has to be estab-
lished. The performance will be evaluated by four criteria, the normalized
root mean square error, correlation and the averaged potential gain and rapg
which were defined in chapter 1. For the stock price prediction, the desired
output (target) is equal to the input stream with a time shift such that
yt = ut+1 if yt is the target and ut is the input stream. Thus, the General
Linear Model is to find a relation between the reservoir state after injecting
23
ut with ut+1. For the artificial tasks, the mapping between ut and yt is ap-
proximated. See chapter 2 for the definitions of yt. The meta parameters of
the system are set to: N = 400, τ = 80, θ = 0.2, γ = 0.05, η = 0.4, T = 1
and these settings are kept the same when evaluating the different detrending
techniques. The time series is seperated into a training set (first 80%) and an
evaluation set (last 20%). The model parameters w are estimated during the
training phase and are solely based on the training set. The above mentioned
performance criteria are computed solely on the basis of the evaluation set.
4.1.1 Results - no detrending - non-stationarity in mean
Figure 8: The evaluation set (blue graph) and its prediction without detrend-
ing (red graph) of a target with non-stationarity in mean.
Figure 8 shows the evaluation set and its prediction of the target with non-
stationarity in mean. One can easily see that the two graphs look similiar but
that it seems like the predicted values are shifted downwards. These visual
cues are supported by the performance measures: although the prediction
and the actual values operate on two different levels, the correlation between
them is quite high with 0.77. This supports the impression that the two
graphs seem to oscillate around their means in tune. The huge NRMSE
of 3.73 as well as the small rapg of 3.7 ∗ 10−4
reveal that the two graphs
operate nevertheless on two different levels. Apparently, their means are
24
not equal. Intuitively, it is possible to interprete the results in the following
way: the mean of the target increases steadily (see Figure 1) and the model
parameters were estimated in the first 80% of the target where the mean was
still small. When predicting the last 20% of the target, the level of the value
is underestimated. This theory is backed up by comparing the mean of the
training set (0.5326) with the mean of the prediction (0.5314). The mean of
the desired output (0.888) is dramatically underestimated.
4.1.2 Results - no detrending - non-stationarity in variance
Figure 9: Section of evaluation set (blue graph) and its prediction (red graph)
of a target with non-stationarity in variance and no detrending.
Similiar results are found when modelling a target process with non-
stationary variance. Figure 9 shows a section of the desired and actual output
of the system. The actual values seem to vary stronger than their prediction
but the directions of the variations seem to be in tune. Again, this considera-
tion are backed up by the performance measured: the correlation between the
prediction and the actual values is fairly high with 0.67 whereas the NRMSE
with a value of 0.78 indicates shortcomings of the model. The high rapg
value of 0.77 is very surprising but can probably explained in the following
way: since the time series has a stationary mean and the prediction (mean of
∼0) seems to share the mean of the actual target (mean of ∼0) and because
the oscilations are generally in tune (reinforced by fairly high correlation),
the predictions of the directions (sign of yt+1 − yt) are pretty accurate. The
rapg basically measures the accuracy of the prediction of the direction of the
25
yt+1 − yt weighted by the actual distance of yt+1 and yt. The difference of
variances can intuitively interpreted similiarly to the differences of the means
of the time series with non-stationary mean: during the model estimation
phase, the variance was a lot smaller than in the evaluation phase. The
system has no means to learn the time-dependency of the mean or variance.
4.1.3 Results - no detrending - non-stationarity in mean and vari-
ance
Figure 10: Section of evaluation set (blue graph) and its prediction (red
graph) of a target with non-stationarity in variance and mean without de-
trending.
The results of modelling a target with non-stationarity in mean and vari-
ance are not surprising. The mean as well as the variance seem to be un-
derestimated while the general direction of the variations of the predicted
time series seem to be in accordance with the desired target. A fairly good
correlation between the predicted and actual values of 0.56, a big NRMSE
of 2.2 and an rapg of 0.117 reinforce these considerations. Figure 10 shows
a plot of a section of the predicted target (red graph) and the desired target
(blue graph).
4.1.4 Results - no detrending - natural economic time series
The results when modelling the natural economic time series are more sur-
prising: Figure 11 shows a section of the prediction and the actual values of
the Google stock. One can see that the prediction seems to lag behind the
26
Figure 11: Prediction without detrending of the Google stock (red graph)
and the actual Google stock (blue graph).
actual values. This intuition is reinforced when looking at the cross correla-
tion. The highest cross correlation is found when shifting the prediction one
time step into the future. But what does that mean? If one recalls that the
input and desired output of the system are the same time series but shifted
by one time step such that yt = ut+1 and the prediction lags one time step
behind the actual values such that yt+1 ≈ yt, one can easily see that the
system is basically just recreating the input series, since yt ≈ ut. Despite the
fact, that the system is basically just recreating the input signal, the classical
performance measures do not indicate shortcomings of the model: the corre-
lation between the actual and predicted values is with 0.9978 very close to 1,
whereas the NRMSE with 0.067 is close to 0. This extraordinary character-
istic is explainable by the huge autocorrelation the time series exhibits. Note
that if yt ≈ yt+1, the correlation between yt and yt measures basically the au-
tocorrelation of yt with a time lag of 1. A plot of the autocorrelation function
reveals that the autocorrelation is decaying very slowly. A slowly decaying
autocorrelation function is a well known trend indicator[2] and detrending
seems to be an admissable effort to overcome the extraordinary behavior of
the system. Also the rapg with 0.37 seems to be quite high in comparison
to the other models but still only 37% of the possible profits are made and
the apg of 0.176 shows that the model does not seem to be profitable in a
stock exchange environment. On average a profit of e 0.17 is made in every
time step per stock. Considering that the transaction fee of most depots
27
is approximately 0.23% [12] of the volume traded and the average price per
stock is e 506 and a stock can on average be hold for 1.63 time steps before it
must be sold, one loses e 0.87 per stock exchanged. Detrending has to show
if the model performance can be significantly improved and if betting on the
prediction can be made profitable. A vague intuitive explaination of these
results requires more knowledge about the results of the different detrending
techniques and is therefor moved to chapter 5.
4.1.5 Summary results - no detrending
Table 1 shows a summary of the performance of the system without de-
trending. One can easily see that the model performances for the different
tasks are unsatisfactory and that the investigation of detrending techniques
is justified.
ns mean ns variance ns mean & var google
corr 0.77 0.67 0.56 0.9978
NRMSE 3.73 0.78 2.2 0.067
apg −3.7 ∗ 10−5
0.115 0.0191 0.176
rapg −4.7 ∗ 10−4
0.77 0.1172 0.37
Table 1: Comparison of performance of the system when no detrending was
employed
4.2 On the expressiveness of the results
In the ongoing chapter the performance of the different detrending techniques
will be evaluated and then compared to the case where no detrending was
employed. The results can vary from trial to trial and there are basically
two characteristics that cause variability in the performance results. The
first source of variability is the random mask which correspond to the input
weights in classical LSMs. These input weights are drawn from a probility
distribution, thus they vary from trial to trial. In order to account for this
variability, the experiments would have to be redone with different input
masks. The problem with this approach is the fact, that the input weights
influence the reservoir dynamics and recalculating the reservoir states is very
time consuming. Computing the reservoir states of 7000 data points takes
approximately 20 minutes on a Intel Core i3-2310m machine with 4GB of
RAM. Experience has shown that different instantiations of bipolar input
masks have only a very small influence on the performance of the system
28
anyway. For the sake of computational tractability, the influence of the ran-
dom input mask on the performance is assumed to be 0.
The secound source of variability stems from the partition of the time
series into the training and evaluation set. Typically, cross validation is em-
ployed when evaluating the performance of machine learning systems, i.e.
numerous partitions are made and the different results are averaged in order
to cancel out the effects of arbitrarily dividing the time series. This would
not require to recompute the reservoir states. But still, in the context of
this experiments this cannot be done for the following reason: If we assume
a stock exchange scenario where the trend may have an influence and we
divide the time series in a way that the evaluation set is not the last part
of the time series, we are basically incorperating future information which in
a real world scenario would not be accessable. Note that the characteristics
of the time series change over time and the gist of this thesis is to somehow
get rid of the time dependence of the time series! We are not incorperating
future information in a way that actual information of future values somehow
influence past values (like filtering the time series with an acausal filter would
do) but in a more subtle way. This concept is probably best understood with
an example: Imagine a time series with linear trend is divided in such a way
that the first and last 40% of the time series belong to the training set and
the missing 20% are used to evaluate the model. In the previous section
of this thesis, we got the impression that the system seems to assume that
the variability and mean it has ”experienced” in the training set generalize
to the evaluation set. In order to minimize the overall quadratic error of
the training set, the system will have to treat the first part of the training
set (first 40% of the time series) equal to the second part (last 40% of time
series) and thus, assume the mean of the time series to be somehow between
the mean of the first and second part of the training set. When the perfor-
mance is now evaluated, although the time series still exhibits a trend and
the system has no means to learn the time dependence of the time series,
the performance will be quite good because the mean of the evaluation set
actually is somewhere between the first and second part of the training set
because of the linearity of the trend.
All in all there is no tractable way to account for the variability of the
results and thus we have no approach which is backed up by statistics to see
if the performance of a detrending model actually lies outside of the vari-
ability of the performance of the non-detrending model. What we can do
is quantitatively compare the different models and assume declined or im-
prove model performances if the model performances are drastically different.
29
In general, the methodology of this thesis allows only for existentially
quantified statements. Propositions of the kind ”this detrending technique
can be used to improve the model performance for all time series with non-
stationarity in mean” are by nature of an empirical instead of an analytical
investigation not possible. But the complexity of such systems make analyti-
cal investigations very hard and considering the fact that all natural sciences
are based on empirical studies and induction, this approach is justified in
the opinion of the author. On top of that, the interpretation of the em-
pirical results can often be backed up by analytical considerations that are
generalizable to other scenarios.
4.3 Bipolarized target
In the previous chapter, we have learned that in order to enable a LSM to
model time series, the yk (the target) have to be independent and identically
distributed. If we assume a stock exchange scenario the information whether
the stock value will rise or fall is sufficient to place a bet, thus a very straight
forward approach is to in a way bipolarize the target. A target value of 1
denotes that the time series will rise in the next time step, 0s denote no
change of the time series and a target value of −1 denotes that the target
will fall. Mathematically speaking, the detrended target stream y becomes
yt = sgn(yt+1 −yt). If we assume that the statistic of whether the time series
will increase or decrease is constant over time, the mean and variance will be
constant and thus we have successfully detrended the time series. We can’t
expect the prediction to be exactly −1, 0 or 1, thus they have to be mapped
to those values in order to evaluate the model: an arbitrary threshold is
introduced and if a predicted value is below − it is said to be −1, if it is
greater than it is set to 1 and all other values are mapped to 0. In this case
= 0.25 was chosen. The threshold for the mapping could in a real world
stock exchange application be obtained by reinforcement learning techniques.
A value function for different values of could be learned by regarding which
apg values different produce. So, in a way, this detrending technique could
still be considered a parameter free technique.
4.3.1 Results - bipolarized target
Table 2 shows the resulting performances with a bioplarized target given the
different tasks. The last row depicts the percentage of 0’s in the prediction
after it was mapped to -1, 0 or 1 with = 0.25. This value can in some
way be interpreted as a measure of uncertainty. The actual target streams
30
very rarely contain 0’s (around 0.1%) and the more 0’s are contained in a
prediction the more uncertain the system is about a prediction.
ns mean ns variance ns mean & var google
apg 0.05 (+0.05) 0.043 (-0.072) 0.047 (+0.028) −6.37 ∗ 10−6
(-0.176)
rapg 0.64 (+0.64) 0.29 (-0.45) 0.29 (+0.17) −1.34 ∗ 10−5
(-0.37)
0’s 30.7% 54% 51% 99.4%
Table 2: Comparison of performance of the system when the target was
bipolarized. The numbers in brackets denote the difference in performance
to no detrending. The last row denotes the ratio of 0’s after the prediction
was rounded.
In the case of a target which is non-stationary in mean, the performance
dramatically increases. 64% of all possible profits are made, while in 30% of
the cases the prediction could not exceed the certainty threshold. The prob-
lem that the mean was drastically underestimated seens to be overcome. The
percentage of 0’s, thus the level of uncertainty increases even more with a
target with non-stationary variance. The overall performance drops severely
in comparison to no detrending which may be due to the high uncertainty of
54%: it is hard to make profits when you do not make bets. The results of
modelling a target with non-stationary mean and variance are comparable.
The apg increases probably because of the fact that the mean is not under-
estimated anymore but the high level of uncertainty most likely encumbers
high apg values. Bipolarizing the target when predicting the economic time
series yields desastrous results. The huge uncertainty of 99.4% averts any
profits.
4.4 Differencing
Differencing is a well known detrending technique which is often employed
when modelling time series with classical statistical models [8]. The con-
cept behind differencing is very easy to understand and can be thought of as
an extension to bipolarizing. Imagine two pairs of consecutive time points.
Imagine that in both cases the time series rises but in one case it will only
increase by a small amount but in the other case the time series makes a huge
jump. After bipolarizing, both pairs would be represented by a 1. The reser-
voir states which are of course also dependent on past inputs might be very
diverse, nevertheless the linear readout neuron tries to map both of them
to 1. In order to minimize the overall error, one of the two actual outputs
31
may have to be pushed below the certainty threshold and in an unfortunate
scenario this happens to the one which would generate a huge profit. When
bipolarizing, the linear readout tries to maximize the probability of identi-
fying whether or not the time series will rise or fall (see 3.2.5). But why
not weigh this probability with the potential profit which could be generated
when it is predicted correctly? Thus, multiplying the bipolarized target with
the potential profit: yd
t = sgn(yt+1 − yt)|yt+1 − yt| = yt+1 − yt.
4.4.1 Results - differencing
ns mean ns variance ns mean & var google
cor 0.72 (-0.05) 0.3 (-0.37) 0.293 (-0.26) 0.997 (-0.008)
NRMSE 0.70 (-3.03) 1.09 (+0.31) 1.09 (-1.11) 0.074 (+0.007)
apg 0.058 (+0.058) 0.058 (-0.05) 0.06 (+0.04) 0.0245 (-0.15)
rapg 0.73 (+0.73) 0.39 (-0.38) 0.367 (+0.25) 0.052(-0.32)
Table 3: Comparison of performance of the system when the target was
differenced. The numbers in brackets denote the difference in performance
to no detrending.
In order to make the correlation and the NRMSE comparable, the differ-
enced values were used to reconstruct the resulting time series in a way that
previous errors do not accumulate, i.e. for the reconstructed time series yr
t
series holds: yr
t = yt + yd
t (and not yr
t = y0 +
t
i=0
yd
i ).
Differencing yields fairly good results when trying to predict a target with
non-stationary mean. 73% of al possiblel profits could have been made in
a stock exchange scenario. The NRMSE dropped by the magnitude of ap-
proximately 5 in comparison to no-detrending but is still fairly high with
0.7. The pretty good results can be explained by recalling how the non-
stationary mean was induced (see 2.2.1 for a definition). A linear time de-
pendency was added to the existing NARMA10 task. The time dependency
responsible for the linear trend vanishes by differencing, since: y1
t+1 − y1
t =
yt+1 + 0.0001(t + 1) − (yt + 0.0001t) = yt+1 − yt + 0.0001. What should be
pointed out here is that the first difference of the target with non-stationary
mean is equal to the first difference of an ordinary NARMA10 task with a
small shift in mean. The NRMSE for the ordinary NARMA10 task is re-
ported to be 0.15 [3]. Thus, the representation of the target as the difference
seems to be harder to learn since the NRMSE more than quadruples.
When the time dependency is multiplicative, i.e. when the corresponding
32
time series exhibits (at least) non-stationarity in variance, it is not possible
to overcome it by differencing: y2
t+1 −y2
t = yt+1(t+1)−ytt (scaling coefficient
omitted for simplicity). It is easy to see that it is not possible to abolish the
dependence on t in this case. This explains the bad performance for the
target with non-stationary variance seen in table 3. The impression that
differencing seems to make learning the NARMA10 task harder is reinforced
considering the fact that the performance actually drops and does not remain
the same.
The overall performance of learning a target with non-stationary mean and
variance increases in terms of NRMSE and rapg but is impaired in terms
of correlation. These findings are not very surprising, considering the fact
that the additive time dependency is abolished by differencing: y3
t+1 − y3
t =
yt+1(t + 1) − ytt + 0.0001 (scaling coefficient again omitted for simplicity).
The resulting time series is stationary in mean, thus the problem of underes-
timating the mean is overcome which improves the NRMSE drastically as we
have seen earlier. The fact that the correlation decreases is again in accor-
dance to the impression that learning the differenced NARMA10 task seems
to be harder for the system than the non-differenced NARMA10.
Differencing the natural economic time series yields very unsatisfactory
results. The performance drops with respect to all performance measure-
ments. The correlation and NRMSE still suggest an admissable model, but
the rapg reveals that betting on the model would generate almost no profits.
Note that these results emphasize the importance of the introduction of the
additional performance measurement rapg, since the classical performance
measurements fail to reveal shortcomings of the model in a stock exchange
scenario. Interpreting these results is very hard since little is known about
the characteristics of the time series. But one should recognize that the sys-
tem was unable to find a relationship of past values with the difference of
future values (yt+1 − yt). This fact will in a later argumentation be useful.
4.4.2 Implicit assumptions when differencing
Classical stastical approaches suggest using the second difference, i.e. the
difference of the differenced series y2d
t+1 = ((yt+1 − yt) − (yt − yt−1)), if a time
series cannot be made stationary by taking the first difference [8]. We have
seen above that one implicitly assumes a linear additive trend when taking
the first difference. But what do we implicitly assume when using a second
difference? It is easy to see, that the second difference successfully abolishes
a quadratic additive trend: y2d
t+1 = (((yt+1 +(t+1)2
)−(yt +t2
))−((yt +t2
)−
(yt−1 + (t − 1)2
))) = yt+1 − 3yt + 3yt−1 − yt−2 + 2.
33
To sum up, differencing is not assumption free and it is only guaranteed to
work if there is a either a linear (first difference) or quadratic trend (second
difference). In order to successfully detrend a time series by differencing, the
time series at hand has to be analyzed and the type of non-stationarity has
to be identified.
4.5 Log differencing
Classical stastical approaches often transform the time series by taking the
logarithm first and then differencing is applied [8]. For the resulting target
stream holds then: yld
t+1 = log(yt+1) − log(yt). This approach bears a fun-
damental problem: the logarithm of a negative value is not defined in the
real numbers and since the PHOCUS LSM operates in the reals, the log-
difference is only defined for positive time series. The resulting performance
of log-differencing can therefore only be evaluated by the natural economic
time series, since it is the only strictly positive time series.
When applying log-differencing the identification of the implicit assumption
is not such an easy endeavor. What we can show is that the time dependency
has to vanish by dividing instead of subtracting two successive data points in
comparison to normal differencing. If we assume a linear multiplicative trend
(→ (at least) non-stationarity in variance): yld
t+1 = log(yt+1(t+1))−log(ytt) =
log(
yt+1
yt
+
yt+1
ytt
), the time dependency is not fully abolished but the influence
of t declines with a growing t.
4.5.1 Results - log-differencing
The reults for predicting the log-differenced target of the natural economic
time series are disillusioning: the NRMSE and correlation still misleadingly
indicate a very good model whereas the rapg of -0.33 reveals the desastrous ef-
fects betting on that model would have generated. Note that in order to make
the performance measurements comparable the actual time series was recon-
structed by reversing the detrending operations. To sum up, log-differencing
requires the time series to be positive and it is not easily comprehensible
when log-differencing is guaranteed to abolish a time dependency.
34
google
cor 0.997 (-0.008)
NRMSE 0.074 (+0.007)
apg −0.15 (-0.32)
rapg −0.33(-0.7)
Table 4: Comparison of performance of the system when the target was log-
differenced. The numbers in brackets denote the difference in performance
to no detrending.
4.6 High-pass filter
4.6.1 What kind of non-stationarities are induced by low-frequency
oscillations?
The next detrending technique is inspired by signal processing. If we assume
that non-stationarities are induced by low-frequency oscillations, high-pass
filtering the time series could be an admissable technique to detrend a time
series. A high-pass filter cuts off frequencies below a certain threshold. Fre-
quencies above that threshold can pass the filter almost undamped. But
how sound is the assumption that low-frequency oscillations induce non-
stationarities?
In order to investigate the effect of high-pass filtering, the following steps
are taken: 1. Transformation of the time series into frequency domain. 2.
Cut-off of low frequencies. 3. Investigation of the effect of cutting off low
frequencies. Fourier transform and Inverse Fourier transform are employed
to transform the series into frequency and time domain respectively.
Imagine a stationary time series is superposed by an additive trend of the
form atn
, such that: ynon stat
t = ystat
t + atn
. In the following part it will
be shown that an additive trend of the form atn
for n ∈ N and n < 4
will predominately effect the amplitude of the low frequency spectrum and
that therefore high pass filtering is a sound approach to abolish the time
dependency of the time series. Since the trend component is additive, the
two components can be Fourier transformed and high-pass filtered indepen-
dently. For the amplitude of frequency ω A(ω) holds A(ω) = |F(ω)|. [14]
Case 1: n = 1:
Since t ≥ 0, we can write for the Fourier transform of at: Fn=1(ω) =
∞
0
ate−2πitω
dt = −
a
(2πω)2
. [13] With a growing ω, |F(ω)| decreases, thus
after high-pass filtering with a suitable cut-off frequency, ∀ωFhp
n=1(ω) ≈ 0,
because the amplitudes of the low-frequency spectrum are set to 0 and for
35
big values of ω holds that Fn=1(ω) ≈ 0 anyways.
Let Fhp
ystat (ω) be the high-pass filtered Fourier transformed image of the
stationary component, then it holds that Fhp
ynon stat (ω) = Fhp
ystat (ω)+Fhp
n=1(ω) ≈
Fhp
ystat (ω). Hence, the time-dependent component vanishes by high-pass fil-
tering and therefore high pass filtering is an admissable approach to diminish
the effect of at on ynon stat
t .
The argumentation for 2 ≤ n < 4 are analogous and only the Fourier trans-
formation of atn
will be given:
Case 2: n = 2:
Fn=2(ω) =
∞
0
at2
e−2πitω
dt = −
ia
4π3ω3
. [13] Thus, An=2(ω) =
a
4π3ω3
de-
creases strongly with a growing ω.
Case 3: n = 3:
Fn=3(ω) =
∞
0
at3
e−2πitω
dt = −
3a
8π4ω4
. [13] Again, the greater ω, the smaller
An=3(ω). Thus the additive trend at3
predominately effects the low frequency
spectrum of ynon stat
t which is cut-off by high-pass filtering. These consider-
ations can probably be extended to all n which is left as an exercise to the
reader (the proof involves solving
∞
0
atn
e−2πitω
dt).
Figure 12: High pass filtered target with non-stationary mean.
These theoretical considerations are backed up by investigating the plot
of a high-pass filtered target with non-stationary mean. Figure 12 shows
the plot of the high-pass filtered target with non-stationary mean defined in
2.2.1. The linear upward trend seems to have vanished.
Let us now consider what happens if a target with a multiplicative time-
depency of the form atn
is high-pass filtered. Multiplication in the time
domain is equal to convolution in the frequency domain [14] and this poses
a direct problem: although the time-dependent component is approximately
0 in the frequency domain (∀ωFhp
n (ω) ≈ 0), it holds that Fhp
ynon stat (τ) =
36
∞
−∞
Fhp
ystat (ω)Fhp
n (ω − τ)dτ = Fhp
ystat (τ), thus the time-dependent component
still has an influence and the detrending seems to have failed. This consid-
erations can again be backed up by investigating the plot of the high-pass
filtered target with non-stationary variance defined in 2.2.2.
Figure 13: High pass filtered target with non-stationary variance.
It is clearly visible that the variance of the high-pass filtered target with
non-stationary variance still seems to increase over time and that detrending
has failed. These findings are in accordance to the theoretical considerations
earlier.
4.6.2 Characteristics of the high-pass filter at hand
The high-pass filter employed for detrending in this thesis was created using
the fdatool included in MATLAB. A causal equiripple finite-impulse filter
with a cutoff and pass frequencies of 10% and 26% of the sampling frequency
respectively was chosen. fdatool is used to compute the filter coefficients hi.
The filtered signal is equal to the discrete convolution of the time series and
the filter coefficients, such that yfiltered
t =
n
i=0
hiyt−1. Note that the high-pass
filtering takes place in the time domain and the cutoff and pass frequencies
are parameters and were in this case chosen by hand. Thus high-pass filter-
ing is neither parameter nor assumption free: we assume an additive trend
of some form and the cutoff and pass frequencies have to be chosen in accor-
dance to the steepness of the additive trend. The cutoff frequency ωcut off
should be chosen in such a way, that ∀ω > ωcut off Fn(ω) ≈ 0. A rule of
thumb seems to be that the steeper the trend, the lower the cutoff frequency.
This relation can be verified by investigating the definitions of the frequency
spectra of the trend components.
37
The order of the filter (the number of filter coefficients) of the resulting
filter is 35. Thus after filtering, the target to predict incorporates informa-
tion of the last 35 time steps. On top of that, since the artificial tasks are
derivations of the NARMA10 task, they incorporate information of the 10
last inputs, thus in order for the system to be able to predict the target cor-
rectly, the information of 45 last inputs must be available in the reservoir.
Experimental results suggest that this exceeds the memory capacity of the
reservoir. In order to overcome this problem, the input as well as the target
are high-pass filtered. Doing so, significantly improved the performance be-
cause by high pass filtering the input series, information of the last 35 inputs
is made present in the current input.
4.6.3 Reconstructing the time series
After the high-pass filtered target is predicted, the corresponding time series
(or at least the information if the time series will rise) has to be reconstructed
from that prediction. For this purpose, the filter operations are inversed by
solving yfiltered
t =
n
i=0
hiyt−i for yt which yields: yt = (yfiltered
t −
n
i=1
yt−ihi)/h0,
with yfiltered
t being the prediction of the filtered target. In order to avoid
accumulating the error of past time steps yt−i (which are available at time t)
instead of the reconstructed prediction yt−i are used.
4.6.4 Results - high-pass filtering
In order to make the correlation and NRMSE comparable, table 5 shows
the performance measures after reconstructing the corresponding time series
from the high-pass filtered prediction.
ns mean ns variance ns mean & var google
cor 0.05 (-0.72) -0.03 (-0.7) -0.027 (-0.58) 0.358 (-0.64)
NRMSE 31.97 (+28.24) 83.1 (+81.41) 84.76 (+82.5) 3.32 (+3.25)
apg 0.005 (+0.005) -0.004 (-0.119) -0.0034 (-0.02) 0.011 (-0.16)
rapg 0.062 (+0.062) -0.027 (-0.79) -0.02 (-0.14) 0.023(-0.35)
Table 5: Comparison of performance of the system when the target was high-
pass filtered and reconstructed afterwards. The numbers in brackets denote
the difference in performance to no detrending after the corresponding time
series was reconstructed.
Predicting the high-pass filtered target led to horrible performances for
all tasks. What is peculiar is the fact that for all artificial tasks the NRMSE
38
exploded and the correlation dropped to approximately 0. However, the
performance for the predicting the economic time series are better in com-
parison to the artificial tasks but still bad in comparison to other detrending
techniques. How can these results be interpreted? Why did this detrending
technique fail so hard? Let us have a look at the performance of the system
before the actual time series has been reconstructed displayed in table 6.
ns mean ns variance ns mean & var google
cor 0.925 0.438 0.45 0.658
NRMSE 0.38 0.908 0.93 0.77
Table 6: Performance when predicting the high-pass filtered target before
the actual time series is reconstructed by inverting the filter operations.
The performance of predicting the filtered target before reconstructing
the actual time series seems to be a lot better. Especially predicting the
filtered target with non-stationary mean seems to work well. The impression
of good performance when predicting the filtered target with non-stationary
mean is in accordance to the theoretical considerations earlier, namely that
an additive trend of the form atn
can be abolished by high-pass filtering.
The reason of the bad performance seems not to reside in predicting the
filtered target but in reconstructing the actual time series. When recon-
structing the time series the error is scaled and thus amplified. Consider
that the prediction of the target is equal to the desired target and some
error : yt
filtered
= yfiltered
t + . If we now recall how the actual time se-
ries was reconstructed and we plug in the definition of yt
filtered
, it is easy to
see that the error is being amplified: yt = ((yfiltered
t + ) −
n
i=1
yt−ihi)/h0 =
yfiltered
t /h0 + /h0 −
n
i=1
yt−ihi/h0. The only source of error is in this case
and for the high-pass filter used it holds that h0 = −0.0084, thus the error
is amplified by a factor of approximately 120.
The question of how to reconstruct the actual time series or how to extract
the information whether or not the time series will increase is a fundamental
problem when predicting a high-pass filtered target. A straight-forward idea
would be to also try to predict the residuals of the filtered time series: yres
t =
yt − yfiltered
t . This idea is flawed because the residuals contain the time
dependent component of the time series and can therefore of course not be
predicted by the PHOCUS LSM. Another straight forward approach would
be to create an exclusive model of the residuals. If we, for example, assume
39
a linear additive trend: yres
t = c ∗ t + b + , by linear regression it would be
possible to estimate the parameters c and b in the evaluation phase. In theory,
this seems to be a sound approach but one cannot expect the residuals to
solely contain the time-dependent component. Figure 14 shows a plot of the
residuals of the high-pass filtered artificial task with non-stationary mean.
One can see that a linear model for the residuals is very coarse and would
not lead to satisfying results.
Figure 14: Residuals of a high-pass filtered target with a linear additive
trend.
5 Conclusion
5.1 Summary
This thesis investigated three kinds of detrending techniques, namely bipolar-
izing, differencing and high-pass filtering. Non of the investigated detrending
techniques is assumption and parameter free. Bipolarizing can be considered
assumption and parameter free if means to obtain the decision threshold
are available (e.g. reinforcement learning). But still, this thesis was not able
to find a detrending technique which gives in conjunction with a PHOCUS
LSM good results in any scenario. There seems not to be a fire-and-forget
detrending technique that can handle non-stationary mean or variance or
both. The type of non-stationarity has to be indentified in order to chose a
suitable detrending technique.
Because of the fact that a detrending technique has to be chosen depend-
ing on the type of non-stationarity at hand, in the following paragraph the
findings of this thesis will be summarized and a recommendation which de-
trending technique to employ is given for each type of non-stationarity. The
40
following considerations are based on the assumption that the incentive in
modelling is ultimately betting on the target, thus that the rapg is the most
informative performance measure. If one encounters a target of which one
can safely assume that it exhibits a linear or quadratic additive trend, first
and second differencing respectively seem to be the most promising detrend-
ing techniques (experimental rapg of 0.73 [linear additive trend]). If the time
dependence is multiplicative and the mean is constant, it is recommended not
to detrend at all. The performance without detrending seems to be already
good (experimental rapg of 0.77) and detrending seems only to do harm.
However, if a time series exhibits a multiplicative as well as an additive time
dependent component, differencing is recommended to get at least rid of the
additive sub-component (experimental rapg of 0.36).
5.2 On the natural time series
Non of the detrending techniques was able to improve the profit betting on
the prediction of the natural time series would have generated. Considering
the fact that detrending was able to improve the performance for the artificial
tasks (non-stationary mean and mean & variance) or the performance was
already quite good (non-stationary variance), the assumption that detrend-
ing caused the unsatisfying performance is questionable. Of course, the type
of time dependence could also be of some non-linear form which was not con-
sidered in this thesis. But, on top of that, the extra-ordinary behavior when
predicting the natural time series, namely that the input series seems to be
recreated (i.e. the actual output is one time step behind the desired), is still
unexplained: One core statistical model for time series is called Autoregres-
sive model (AR): xt = c+
p
i=1
φixt−1+ t. [2] φi are called model parameters, p
is the order of the AR-model, t are white noise (mean 0, standard deviation
σ ) and c is a constant. Thus, an autoregressive model can be seen as a filter.
In 3.1.3, we have learned that reservoir computers are in principle capable of
approximating every time invariant filter with fading memory. Consider now
an AR(1) (order 1) process yt = φyt−1 + t and imagine an LSM which per-
fectly predicts this process, thus the LSM predicts the expectation of yt with
perfect knowledge of φ: E[yt] = E[φyt−1 + t] = E[φyt−1]+E[ t] = φyt−1, be-
cause E[ t] = 0 (white noise). The perfect prediction of yt is a scaled version
of yt−1! If the natural time series follows an AR(1) process than its perfect
prediction is a scaled version of the current value, i.e. the perfect prediction
seems to reconstruct the input series. There is not enough structure in the
time series to be properly predicted, i.e. the ratio of t to yt is not good
41
enough. This theory explains why the prediction seems to lag behind the
actual values and why detrending did not bring about the desired effects.
42
A Normalized root mean square error
Let y and y be vectors of length n, σ2
y be the variance of y, then the normalized
root mean square error (NRMSE) is defined as:
NRMSE(y, y) =
n
i=0(yi − xi)2
nσ2
y
B On NRMSE and correlation
The fact that the NRMSE and correlation are not always very informative in
a stock exchange scenario is best explained by an example. Imagine a time
series with a linear upward trend where the first 50% of the time series are
below its mean and the last 50% above. Given a prediction, the correlation
between the actual time series and its prediction is a measure of the mutual
oscillation around their respective means. Imagine both time series, the ac-
tual target and its prediction, share their mean and the stock decreases from
time point t to t+1. If the model predicts that the time series will increase
by a very tiny amount, the correlation might be still high if the value of the
prediction for t+1 is still below the mean although betting on that prediction
would generate a loss.
At the same time, the NRMSE error might also not be very informative
since it basically is a normalized measure of the distance of two time series.
Imagine two distinct predictions and a time series which slightly decreases
from time t to t+1 by . Prediction1 might correctly be predicting a decrease
of the time series but is dramatically overestimating it whereas prediction2
might be predicting a slight increase. The NRMSE of prediction2 might
be smaller than the NRMSE of prediction1 although in this case dramati-
cally overestimating a decrease would still generate a profit whereas slightly
overestimating might generate a deficit.
C Signum function
sgn(x) :=



+1 x > 0
0 x = 0
−1 x < 0
[15]
43
References
[1] Christopher M. Bishop, Pattern Recognition and Machine Learning
Springer, 2006
[2] T. Mills, N. Markellos, The Econometric Modelling of Financial Time
Series Cambridge University Press, 2008
[3] Towards a PHOtonic liquid state machine based on delay- CoUpled Sys-
tems, Deliverable D4, 2010
[4] Tom M. Mitchell, Machine Learning. McGraw-Hill Sci-
ence/Engineering/Math, 1997
[5] Mauk, M.D. & Buonomano, D.V. The neural basis of temporal processing
Annu. Rev. Neurosci. 27, 2004
[6] Cover, T.M. Geometrical and Statistical properties of systems of linear
inequalities with applications in pattern recognition, IEEE Transactions
on Electronic Computers EC-14, 1965
[7] D. Buonomano, W. Maass, State-dependent computations: Spatiotem-
poral processing in cortical networks, Nature Reviews in Neuroscience,
Volume 10, 2009
[8] Alan Pankratz, Forecasting With Univariate Box- Jenkins Models John
Wily & Sons., 1983
[9] Steven H. Strogatz, Nonlinear dynamics and chaos: with applications to
physics, biology, chemistry, and engineering Westview Press, 1994
[10] A. F. Atiya and A. G. Parlos, New Results on Recurrent Network Train-
ing: Unifying the Algorithms and Accelerating Convergence IEEE Trans.
Neural Networks, vol. 11, 2000.
[11] Wolfgang Maass , Henry Markram On the Computational Power of Cir-
cuits of Spiking Neurons 2004
[12] http://www.online-broker-vergleich.de/vergleich.php, 14.12.2011, 15:00
[13] http://www.wolframalpha.com/, 15.12.2011, 15:00
[14] http://en.wikipedia.org/wiki/Fourier transform, 15.12.2011, 15:00
[15] http://de.wikipedia.org/wiki/Signum (Mathematik), 15.12.2011, 15:00
44
[16] H. Jaeger, The ”echo state” approach to analysing and training recurrent
neural networks, GMD Report 148, German National Research Center for
Information Technology, 2001
[17] W. Maass, T. Natschlger, and H. Markram A model for real-time com-
putation in generic neural microcircuits, Proc. of NIPS 2002, Advances
in Neural Information Processing Systems, MIT Press, 2003.
45
Hereby I confirm that I wrote this thesis independently and that I have
not made use of any other resources or means than those indicated.
Hiermit best¨atige ich, dass ich die vorliegende Arbeit selbst¨andig verfasst
und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet
habe.
(Ort, Datum) (Unterschrift)
46

More Related Content

What's hot

Quantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesQuantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesLester Ingber
 
control adaptive and nonlinear
control adaptive and nonlinear control adaptive and nonlinear
control adaptive and nonlinear Moufid Bouhentala
 
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggRohit Bapat
 
1508.03256
1508.032561508.03256
1508.03256articol
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew Hair
 
Vasicek Model Project
Vasicek Model ProjectVasicek Model Project
Vasicek Model ProjectCedric Melhy
 
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...SSA KPI
 

What's hot (12)

Quantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture SlidesQuantum Variables in Finance and Neuroscience Lecture Slides
Quantum Variables in Finance and Neuroscience Lecture Slides
 
Thesis
ThesisThesis
Thesis
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
control adaptive and nonlinear
control adaptive and nonlinear control adaptive and nonlinear
control adaptive and nonlinear
 
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zinggFundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
Fundamentals of computational_fluid_dynamics_-_h._lomax__t._pulliam__d._zingg
 
1508.03256
1508.032561508.03256
1508.03256
 
thesis
thesisthesis
thesis
 
t
tt
t
 
tamuthesis
tamuthesistamuthesis
tamuthesis
 
Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3Andrew_Hair_Assignment_3
Andrew_Hair_Assignment_3
 
Vasicek Model Project
Vasicek Model ProjectVasicek Model Project
Vasicek Model Project
 
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
All Minimal and Maximal Open Single Machine Scheduling Problems Are Polynomia...
 

Viewers also liked

Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...
Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...
Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...Pratheeban Rajendran
 
Prediction of stock market index using genetic algorithm
Prediction of stock market index using genetic algorithmPrediction of stock market index using genetic algorithm
Prediction of stock market index using genetic algorithmAlexander Decker
 
Online algorithms and their applications
Online algorithms and their applicationsOnline algorithms and their applications
Online algorithms and their applicationsVikas Jindal
 
jurnal tetang contoh web restoran
jurnal tetang contoh web restoranjurnal tetang contoh web restoran
jurnal tetang contoh web restorannozaladijunior
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisOktay Bahceci
 
My Thesis about Internet of Things
My Thesis about Internet of ThingsMy Thesis about Internet of Things
My Thesis about Internet of ThingsNata Nael
 
handbook sistem informasi manajemen proses bisnis
handbook sistem informasi manajemen proses bisnishandbook sistem informasi manajemen proses bisnis
handbook sistem informasi manajemen proses bisnisAgung Apriyadi
 
1 pengantar-proses-bisnis
1 pengantar-proses-bisnis1 pengantar-proses-bisnis
1 pengantar-proses-bisnisWira Yasa
 
Sistem pendukung keputusan
Sistem pendukung keputusanSistem pendukung keputusan
Sistem pendukung keputusanWisnu Dewobroto
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity predictionData Science Warsaw
 
B2B price prediction through crowd sourcing
B2B price prediction through crowd sourcingB2B price prediction through crowd sourcing
B2B price prediction through crowd sourcingEdwin Vlems
 
Calonbukuansi 091201184024-phpapp01
Calonbukuansi 091201184024-phpapp01Calonbukuansi 091201184024-phpapp01
Calonbukuansi 091201184024-phpapp01Fajar Baskoro
 
2015 Holiday Shopping Prediction
2015 Holiday Shopping Prediction2015 Holiday Shopping Prediction
2015 Holiday Shopping PredictionAdobe
 
Analisis pada e-commerce dan website Tokopedia.com
Analisis pada e-commerce dan website Tokopedia.comAnalisis pada e-commerce dan website Tokopedia.com
Analisis pada e-commerce dan website Tokopedia.comCllszhr
 
Jurnal wahyu-nurjaya
Jurnal wahyu-nurjayaJurnal wahyu-nurjaya
Jurnal wahyu-nurjayaFajar Baskoro
 
Perencanaan sistem informasi
Perencanaan sistem informasiPerencanaan sistem informasi
Perencanaan sistem informasiKus Naeni
 
Makalah perancangan web (website 5 k lapak)
Makalah perancangan web (website 5 k lapak) Makalah perancangan web (website 5 k lapak)
Makalah perancangan web (website 5 k lapak) Roni Darmanto
 

Viewers also liked (18)

Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...
Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...
Predicting and Optimizing the End Price of an Online Auction using Genetic-Fu...
 
Prediction of stock market index using genetic algorithm
Prediction of stock market index using genetic algorithmPrediction of stock market index using genetic algorithm
Prediction of stock market index using genetic algorithm
 
Online algorithms and their applications
Online algorithms and their applicationsOnline algorithms and their applications
Online algorithms and their applications
 
jurnal tetang contoh web restoran
jurnal tetang contoh web restoranjurnal tetang contoh web restoran
jurnal tetang contoh web restoran
 
Stock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_AnalysisStock_Market_Prediction_using_Social_Media_Analysis
Stock_Market_Prediction_using_Social_Media_Analysis
 
Konsep Rekayasa Perangakat Lunak
Konsep Rekayasa Perangakat LunakKonsep Rekayasa Perangakat Lunak
Konsep Rekayasa Perangakat Lunak
 
My Thesis about Internet of Things
My Thesis about Internet of ThingsMy Thesis about Internet of Things
My Thesis about Internet of Things
 
handbook sistem informasi manajemen proses bisnis
handbook sistem informasi manajemen proses bisnishandbook sistem informasi manajemen proses bisnis
handbook sistem informasi manajemen proses bisnis
 
1 pengantar-proses-bisnis
1 pengantar-proses-bisnis1 pengantar-proses-bisnis
1 pengantar-proses-bisnis
 
Sistem pendukung keputusan
Sistem pendukung keputusanSistem pendukung keputusan
Sistem pendukung keputusan
 
Online content popularity prediction
Online content popularity predictionOnline content popularity prediction
Online content popularity prediction
 
B2B price prediction through crowd sourcing
B2B price prediction through crowd sourcingB2B price prediction through crowd sourcing
B2B price prediction through crowd sourcing
 
Calonbukuansi 091201184024-phpapp01
Calonbukuansi 091201184024-phpapp01Calonbukuansi 091201184024-phpapp01
Calonbukuansi 091201184024-phpapp01
 
2015 Holiday Shopping Prediction
2015 Holiday Shopping Prediction2015 Holiday Shopping Prediction
2015 Holiday Shopping Prediction
 
Analisis pada e-commerce dan website Tokopedia.com
Analisis pada e-commerce dan website Tokopedia.comAnalisis pada e-commerce dan website Tokopedia.com
Analisis pada e-commerce dan website Tokopedia.com
 
Jurnal wahyu-nurjaya
Jurnal wahyu-nurjayaJurnal wahyu-nurjaya
Jurnal wahyu-nurjaya
 
Perencanaan sistem informasi
Perencanaan sistem informasiPerencanaan sistem informasi
Perencanaan sistem informasi
 
Makalah perancangan web (website 5 k lapak)
Makalah perancangan web (website 5 k lapak) Makalah perancangan web (website 5 k lapak)
Makalah perancangan web (website 5 k lapak)
 

Similar to thesis

Similar to thesis (20)

Clustering Financial Time Series and Evidences of Memory E
Clustering Financial Time Series and Evidences of Memory EClustering Financial Time Series and Evidences of Memory E
Clustering Financial Time Series and Evidences of Memory E
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
UCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_finalUCHILE_M_Sc_Thesis_final
UCHILE_M_Sc_Thesis_final
 
A Probabilistic Pointer Analysis For Speculative Optimizations
A Probabilistic Pointer Analysis For Speculative OptimizationsA Probabilistic Pointer Analysis For Speculative Optimizations
A Probabilistic Pointer Analysis For Speculative Optimizations
 
main
mainmain
main
 
Samba0804
Samba0804Samba0804
Samba0804
 
esnq_control
esnq_controlesnq_control
esnq_control
 
Michael_Mark_Dissertation
Michael_Mark_DissertationMichael_Mark_Dissertation
Michael_Mark_Dissertation
 
UROP MPC Report
UROP MPC ReportUROP MPC Report
UROP MPC Report
 
HHT Report
HHT ReportHHT Report
HHT Report
 
vatter_pdm_1.1
vatter_pdm_1.1vatter_pdm_1.1
vatter_pdm_1.1
 
HonsTokelo
HonsTokeloHonsTokelo
HonsTokelo
 
Lab04_Signals_Systems.pdf
Lab04_Signals_Systems.pdfLab04_Signals_Systems.pdf
Lab04_Signals_Systems.pdf
 
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load BalancingMaster Thesis - A Distributed Algorithm for Stateless Load Balancing
Master Thesis - A Distributed Algorithm for Stateless Load Balancing
 
final_report_template
final_report_templatefinal_report_template
final_report_template
 
project report(1)
project report(1)project report(1)
project report(1)
 
Neural Networks on Steroids
Neural Networks on SteroidsNeural Networks on Steroids
Neural Networks on Steroids
 
The value at risk
The value at risk The value at risk
The value at risk
 
Thesis
ThesisThesis
Thesis
 
1100163YifanGuo
1100163YifanGuo1100163YifanGuo
1100163YifanGuo
 

thesis

  • 1. Time Series Prediction with Reservoir Computers using a Delay Coupled Non-Linear System Henning Lange January 26, 2012 1
  • 2. Contents 1 Introduction 4 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Stock price as a time series . . . . . . . . . . . . . . . . . . . . 4 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Stationarity in time series 7 2.1 What is stationarity? . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Different types of non-stationarity . . . . . . . . . . . . . . . . 7 2.2.1 Non-stationarity in mean - NARMA10 . . . . . . . . . 8 2.2.2 Non-stationarity in variance - NARMA10 . . . . . . . . 9 2.2.3 Non-stationarity in mean and variance - NARMA10 . . 9 2.2.4 Natural economic time series - Google Stock price . . . 10 3 Mackey Glass reservoir 11 3.1 What is reservoir computing? . . . . . . . . . . . . . . . . . . 11 3.1.1 Internal states of the reservoir . . . . . . . . . . . . . . 12 3.1.2 Output generation and training . . . . . . . . . . . . . 13 3.1.3 The reservoir computing paradigm . . . . . . . . . . . 13 3.1.4 The advantages of reservoir computing . . . . . . . . . 13 3.2 Mackey Glass . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 How are inputs fed into the system? . . . . . . . . . . . 15 3.2.3 How is the reservoir input non-linearly transformed? . 16 3.2.4 How are virtual nodes interconnected? . . . . . . . . . 19 3.2.5 How are is the output generated and how are the out- put weights trained? . . . . . . . . . . . . . . . . . . . 21 3.2.6 Why is weak stationarity a neccessary condition for learnability with LSMs? . . . . . . . . . . . . . . . . . 23 4 Detrending techniques 23 4.1 What happens when no detrending is employed? . . . . . . . . 23 4.1.1 Results - no detrending - non-stationarity in mean . . . 24 4.1.2 Results - no detrending - non-stationarity in variance . 25 4.1.3 Results - no detrending - non-stationarity in mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1.4 Results - no detrending - natural economic time series . 26 4.1.5 Summary results - no detrending . . . . . . . . . . . . 28 4.2 On the expressiveness of the results . . . . . . . . . . . . . . . 28 4.3 Bipolarized target . . . . . . . . . . . . . . . . . . . . . . . . . 30 2
  • 3. 4.3.1 Results - bipolarized target . . . . . . . . . . . . . . . 30 4.4 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.4.1 Results - differencing . . . . . . . . . . . . . . . . . . . 32 4.4.2 Implicit assumptions when differencing . . . . . . . . . 33 4.5 Log differencing . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.1 Results - log-differencing . . . . . . . . . . . . . . . . . 34 4.6 High-pass filter . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.6.1 What kind of non-stationarities are induced by low- frequency oscillations? . . . . . . . . . . . . . . . . . . 35 4.6.2 Characteristics of the high-pass filter at hand . . . . . 37 4.6.3 Reconstructing the time series . . . . . . . . . . . . . . 38 4.6.4 Results - high-pass filtering . . . . . . . . . . . . . . . 38 5 Conclusion 40 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 On the natural time series . . . . . . . . . . . . . . . . . . . . 41 A Normalized root mean square error 43 B On NRMSE and correlation 43 C Signum function 43 3
  • 4. 1 Introduction 1.1 Motivation Predicting time series has always been of great interest. There seems to be an abundance of scenarios in which knowledge of future values of a time series would be very desirable. In addition to finance where the incentive to predicting time series is obvious, social sciences, ecology or meteorology, to name a few, provide other examples in which tools for predicting future values of time series are desired. Statistical models, such as autoregressive or moving average models, are typically employed for this task. Reservoir computing, a novel recurrent neural computation framework, is a promising alternative to stastical models for a number of reasons. Classi- cal stastical models (i.e. autoregressive or moving average models) can only capture linear relations between past and future values whereas in reservoir computing non-linear relations can be learned by linear training methods. To be more precise, when predicting out of auto-structure, computations with reservoirs allow the discovery of linear relations of future values with a pool of memory dependent non-linear transformations of past values. But, as we will see in more detail later, a neccessary condition of a time series to be predicted by the means of reservoir computing is stationarity. This thesis will investigate different techniques of transforming non-stationary time series to stationary time series and their impact on the learnability with Mackey-Glass reservoirs empirically. 1.2 Stock price as a time series There is a natural incentive in wanting to predict stock prices and, what is more important for this thesis, they often exhibit non-stationarity. Because of this, they will serve in addition to three artificial tasks which will be de- fined later as a basis of this empirical investigation. Furthermore, in order for the prediction of a future value of a time series to be useful, it is often sufficient to make short term predictions. For example, the knowledge of the stock price in 1 minute would be enough to make significant profits. Math- ematically speaking, when predicting stock prices all information at time t and prior to t can be used to predict the stock price at time point t + 1 and what counts is often not the perfect prediction of the value, but the general direction (whether or not the price will increase or decrease) is sufficient. As we cannot simply predict the stock value because of the trend, a detrended transformation of the stock value must be predicted, but the real stock price must be inferable from the detrended prediction. This characteristic which 4
  • 5. will also be imposed on the artificial tasks poses a direct requirement for the detrending techniques, namely that the prediction of the real value of time t + 1 (or at least a prediction of the direction of the real value) must be re- constructabe solely by information available at time t. In other words, there must exist a function f−1 , which only uses information available at time t, such that f−1 (f(yt)) = yt, if f is our detrending function and y is the time series to predict. On top of that, the property that predicting the exact value is not neccessary, typical performance measures for evaluating the prediction, such as the nor- malized root mean square error (NRMSE) (see appendix A) or correlation do not always make very much sense. It is easy to construct scenarios in which a time series with a lower correlation or higher NRMSE would generate more profit than another time series with greater correlation and lower NRMSE if one would place bets on them in a stock exchange setting (see appendix B for a more thorough explanation). In order to capture this property, a performance measurement is introduced which is closely related to the aver- aged gain (or loss) that betting on the prediction would have generated in a stock exchange scenario where in each time step 1 stock is exchanged. Note that all transaction or similiar fees are neglected. Let sgn(x) be the signum function (see appendix C), yt and yt be the time series and its prediction respectively, we define the average potential gain apg as apg(yt, yt) = 1 T − 1 T−1 t=1 (yt+1 − yt)sgn(yt+1 − yt), where apg values below 0 denote a potential loss and values above 0 a potential gain. sgn(yt+1 −yt) can in principle replaced by anything that conveys the information whether the model predicts that the time series will increase or decrease. apg(yt, yt) is pos- sibly unbounded and more volatile time series can produce higher apg values whereas the perfect prediction of a constant time series will only yield an apg of 0. In order to overcome this problem, we define rapg(yt, yt) = apg(yt, yt) apgmax(yt) with apgmax(yt) = 1 T − 1 T−1 t=1 (yt+1 −yt)sgn(yt+1 −yt). rapg(yt, yt) is bounded by -1 and 1 and it can in principle be used to compare the model perfor- mance across different time series. A more fine-grained model of the stock exchange scenario would allow buying in principle any amount of stocks, thus the potential gain in every time step would be the quotient instead of the difference of the time series at successive time steps. The problem with that approach is that it becomes nonsensical for negative values and the artificial tasks may contain negative values. 5
  • 6. 1.3 Overview The goal of this thesis is to investigate different detrending techniques and their impact of the learnability with a special type of reservoir computer. As this investigation is empirical, the performance resulting from detrending and predicting has to be measured given specific tasks. As stated above, four different tasks will be employed, namely three different artificial tasks where the respective type of non-stationarity is known and one ”natural” task: pre- dicting future values of a stock. The next chapter will deal with the question what stationarity means. Furthermore, different types of non-stationarity are analyzed and example processes which exhibit these types of non-stationarity are given which will later be used to evaluate the different detrending tech- niques. In chapter 3, the question what reservoir computing is and how it works are adressed. A special type of reservoir was employed in this the- sis where spatial multiplexing is substituted by temporal multiplexing which allows the computations to be carried out by a single node which in turn can be simulated by a laser. This technique has certain implications which are also adressed in chapter 3. The forth chapter is concerned with different detrending mechanisms and their impact on the performance of the model will be evaluated. The chapter can roughly be divided into 3 parts, namely bipolarization, differencing and high-pass filtering. The performances of the different detrending techniques are also discussed in chapter 4. The fifth and last chapter summarizes the findings of this thesis and gives a conclusion. 6
  • 7. 2 Stationarity in time series 2.1 What is stationarity? Based on [2], in order to understand the concept of stationarity, time series which are typically understood merely as a sequence of data points at a fixed temporal interval need to be viewed from a different perspective. For the notion of stationarity, it is useful to view a time series as a realization of a stochastic process. From this perspective, a time series is a sequence of (dependent) random variables, thus a time series of length T can be seen as a T-dimensional probability distribution, denoted by {Xt}T 1 . Hence, every data point xt can be seen as a sample of Xt and every random variable Xt is associated with a mean µt and a variance σ2 t which are typically unknown because they cannot be inferred from a single realization. In this sense, a time series is said to be strictly stationary iff the joint prob- ability distribution of any sets of time points t1, t2, ..., tm is constant over time, thus ∀kP(Xt = xt, ..., Xm = xm) = P(Xt+k = xt, ..., Xt+m+k = xt+m). Weak (or wide-sense) stationarity is a special case of strong stationar- ity. A time series is said to be weakly stationary iff the joint probability distribution of any 2-element set of time points tm, tn is constant over time, thus ∀kP(Xn = xn, Xm = xm) = P(Xn+k = xn, Xm+k = xm). Self-evidently, strong stationarity implies weak stationarity and weak stationarity implies in turn, that the mean and variance of the time series are constant over time and that the covariance is only dependent on the shift in time k, thus Cov(Xt, Xt+k) = γk = constant. Note that by assuming weak stationarity, the mean, variance and covariance function may be estimated from the real- ization since we assume that they are constant over time and thereby stem from the same probability distribution. As we will see later, a neccessary condition for a time series to be modelled by the means of reservoir computing is weak stationarity. 2.2 Different types of non-stationarity The two most common types of non-stationarities are based on [8] non- stationary mean and non-stationary variance. Economic time series which are non-stationary in variance are also often non-stationary in mean. In the next part, three different tasks are defined who specifically exhibit non- stationarity in mean, non-stationarity in variance and non-stationarity in mean and variance. All tasks are a derivations of the NARMA10 (Non-linear Autoregressive Moving Average of order 10) task which was introduced in [10] and has become a benchmark for reservoir computing tasks. Diverging from 7
  • 8. predicting time series out of the auto-structure which basically means that the input of the system and the target are the same time series but shifted by 1 time step, in the NARMA10 task the input u is a series of random numbers drawn from a uniform distribution from the interval [0 0.5] and the target is defined by the recursive function yt+1 = 0.3yt + 0.05yt( 9 i=0 yt−i) + 1.5utut−9, where ut is the input at time t. 2.2.1 Non-stationarity in mean - NARMA10 Figure 1: Instance of NARMA10 with non-stationary mean. The blue graph depicts the NARMA10 target after a linear trend is added. The red graph depicts the variance of a 500 timesteps long bin. The green line denotes the mean of the entire target. One can easily see that the target in the first part is significantly smaller than the values of the second part which leads to the conlusion that the mean does not remain constant over time. In order to induce non-stationarity in mean, a linear trend is added to 8
  • 9. the existing task. The input of the system remain random numbers from a uniform distribution in the interval [0 0.5] but the target values are altered: y1 t = yt + 0.0001t, where t ∈ 1..7000 denotes the position in the target. Note that the linear trend is added to an existing NARMA10 target and is not propagated further by recursion. See Figure 1 for a plot of a realization of such a new target with non-stationary mean. Visual inspection makes the fact that the mean of the target increases over time and thus is not constant apparent. The target values of the first part are often below the overall mean whereas target values of the second part are often above the mean. The variance is still stationary and is plotted in bins of 500 time steps each. The fact that the variances of different bins fluctuate is explainable by the stochastic nature of the process and is still deemed to be constant. 2.2.2 Non-stationarity in variance - NARMA10 To induce non-stationarity in variance, two alterations to the existing NARMA10 task have to be carried out. First the mean of the target has to be shifted to 0 to make it invulnerable to the later alterations which can be done by altering the interval of the uniform random inputs u to [-0.5 0.5]. Second, in order to induce a time dependence of the variance the target values are multiplied by a time dependent term: y2 t = yt(0.2 + t/7000), where t ∈ 1..7000 again denotes the position in the target. Again, the alterations to the target are carried out to the existing NARMA10 and the alterations to a single target value are not propagated to other values by recursion! Figure 2 shows the plot of a realization of such a new target. The red graph denote the variance of a bin of 500 time steps. One can easily see that the variance increases over time, thus it is not constant. The mean still seems to be constant over time. 2.2.3 Non-stationarity in mean and variance - NARMA10 The last of the artificial benchmark tasks is to exhibit non-stationarity in mean and variance and it is a combination of the above defined two. The target is made non-stationary in variance by the same steps as above, namely shifting the mean to 0 by altering the interval from where the inputs are drawn from to [-0.5 0.5] and by multiplying the resulting target by a time dependent term. After that, in order to induce non-stationarity in mean a linear trend is added, thus y3 t = yt(0.2 + t/7000) + 0.0001t with t ∈ 1..7000 again denoting the position in the target. Figure 3 depicts such a new target. The variance as well as the mean increase over time, hence they are not constant. 9
  • 10. Figure 2: NARMA10 with non-stationary variance. The target (blue graph) is altered in a way that it is not stationary in mean anymore. The target was divided into 14 bins `a 500 timesteps each. The red graph denote the variance of the respective bin, whereas the green line depict the overall mean. 2.2.4 Natural economic time series - Google Stock price As already stated above, in order to evaluate the performance of the different detrending techniques, one natural economic time series will be modelled. Predicting the minutely closing values of the Google stock from 1st June 2011 19:10 until 29th June 2011 19:59 in total comprosing 7848 data points will be used for this task. Figure 4 depicts the first 7000 data points of the time series. The fact that the mean is not constant is apparent, since the first half of the time series is clearly bigger than the mean, whereas the second half is smaller than the mean. The variance also seems to vary over time, since it seems to fluctuate heavily. 10
  • 11. Figure 3: NARMA10 target (blue graph) which exihibts non-stationarity in mean and variance. The overall mean is visualized by the green line. The red graph again denote the variance of the respective time bin. One can easily see that the mean as well as the variance are not constant. 3 Mackey Glass reservoir 3.1 What is reservoir computing? Reservoir computing is a novel type of recurrent neural networks which try to model spatiotemporal processing in cortical networks. It emphasizes the importance of temporal structure in information. Reservoir computing it- self is not an algorithm but a framework and subsumes different instances of reservoir computing algorithms. Echo State Networks [16] and Liquid State Machines [17] seem to be the most prominent members of the reservoir com- puting family. All instances of reservoir computers share that they consist of a random but fixed recurrent neural network which is also called reservoir. The reservoir consists of non-linear interconnected nodes that are driven or 11
  • 12. Figure 4: The frist 7000 data points of the Google stock price in June. The variance (red graph) as well as the mean do not seem to be constant. excited by the input. The weights between those reservoir nodes are selected randomly and remain fixed. The reservoir is connected to a read-out neuron whose weights are adapted during the training phase. The read-out neuron is typically a linear node. 3.1.1 Internal states of the reservoir Because of the recurrent nature of the reservoir, when it is excited by an input, its future state is dependent on the internal state prior to the input and the input itself. To give a graphic explaination one can make an analogy with a liquid. Imagine the surface of a liquid which was excited by dropping pebbles of different shape and weight into it. It is covered in ripples. The ripples, their direction and speed, comprise the internal state of the reservoir. Imagine now that a new pebble (input) hits the surface of the liquid. The new internal state depends on the old state and the characteristics of the 12
  • 13. pebble. The ripples, their direction and speed, contain information not only about the last pebble thrown into the liquid, but also fading information about past pebbles (inputs). This characteristic is called fading memory. 3.1.2 Output generation and training The output is generated by a read-out neuron which is connected to a subset of the reservoir nodes. The read-out neuron is typically a linear node and its weights are the only ones which are adapted during training. During training the weights of the read-out unit are adjusted in some way that the error between the actual output, generated by the internal state of the reservoir and the read-out weights, and the desired output (or teacher output or target) is in a way minimized. Different techniques like for example Perceptron Learning, Generalized Linear Models, or maximum margin techniques like Support Vector Machines can be employed. 3.1.3 The reservoir computing paradigm Generally speaking, a reservoir computer takes an input stream ut and maps it onto an output stream yt which can also be refered to as target stream or simply target. Such a mapping is called a filter in engineering. To be more precise, a reservoir computer can be seen as a cascading of two filters: the first non-linear filter being the reservoir, the second most of the times linear filter being the read-out mapping. Filters can exhibit certain properties and it has been shown that reservoir computers can in principle approximate ev- ery time invariant filter with fading memory.[11] Classical stastical models for time series assume a relationship between a future value and past val- ues of the time series. Relationships are mostly found in close vicinity and values in the distant past do not influence future values most of the time.[8] This characteristics is in accordance with with the fading memory property of reservoir computers. Classical neural network approaches are not able to naturally incorporate such a memory dependence, although it would be pos- sible to spatialize the temporal dimension, i.e. additionally feeding the input of times t − 1, ..., t − n at time t into the network. So all in all, reservoir computing approaches seem to be highly suitable for forecasting time series. 3.1.4 The advantages of reservoir computing Additionally, reservoir computing offers great advantages from a computa- tional machine learning point of view. Most of the classical artificial neural network approaches work on a discrete time scale [4] whereas reservoir com- puting as we will see in more detail later allows in principle for computa- 13
  • 14. tions on continuous input streams. Furthermore, consider the error function of multi-layer backpropagation networks which often exhibit multiple local minima. Also consider simple single-node classifiers (e.g. perceptron or Sup- port Vector Machines) which are often found unable to seperate inputs in a low dimension.[4] According to the Cover theorem [6] the probability of the separability of inputs increases with the number of dimensions the input is projected to in a non-linear fashion. The reservoir weights are fixed and it serves as a memory dependent non-linear transformator. As the read-out is connected to (a subset of) the reservoir nodes and the number of nodes it is connected to should exceed the number of dimensions of the reservoir input, the reservoir can be seen as a non-linear memory dependent kernel which helps the read-out classifier to seperate inputs by non-linearly pro- jecting the input into a higher dimensional space. To sum up, in theory reservoir computing overcomes the problems of multi-layer backpropagation networks (local minima in error function) and simple single-node classifiers (linear seperability). 3.2 Mackey Glass 3.2.1 Introduction A special kind of reservoir was employed in this thesis. The reservoir com- puting approach used in this thesis is an instance of a Liquid State Machine (LSM) and it originates from the PHOCUS project. PHOCUS is an acronym which stands for towards a PHOtonic liquid state machine based on delay- CoUpled Systems. The computational nodes of the system are delay coupled dynamical systems which will execute the non-linear transformations for the reservoir. The following chapter is based on [3]. In ordinary reservoir computing, the topology of the reservoir is often ran- dom but fixed. Computational units are often sparsely connected and there are basically no constraints to the spatial topology of the network. The acti- vation of all reservoir nodes at a certain time in point t make up the state of the reservoir at t. The number of reservoir nodes determine the number of dimensions the input is projected into. In the PHOCUS LSM however, spa- tial multiplexing is substituted by temporal multiplexing which in principle means that all computations are carried out by a single node (at the same point in space) at different points in time whereas in classical LSMs multiple nodes carry out the computations at the same time at different positions in space. This technique allows the computations to be carried out by single laser but also imposes some limitations on the topology of the resulting net- work. All PHOCUS reservoirs show a special ring structure of virtual nodes. 14
  • 15. One might now have the impression that the maximum number of dimen- sions the input can be projected into is 1, since the number of computational units is 1. If that were the case, the computational power of the system was desastrous since according to [6] the higher the number of dimension an input is projected into the higher the chance of linear seperability. This problem is overcome by considering another temporal dimension of the system and by introducing virtual nodes. As stated above, in classical reservoir computing, every point in time t is associated with a reservoir state x(t) containing the activations of all reservoir nodes at time t. In the PHOCUS LSM, due to temporal multiplexing the state of the reservoir x(t) is induced by the con- tinuous state of the computational node in the interval (t−1)∗τ ≤ s ≤ t∗τ, such that the state of the single computational unit during one τ makes up one reservoir state. The PHOCUS LSM consists of virtual nodes. They are called virtual because their state is simply the time delayed state of the single computational node. During one τ the single computational node carries out the computation of every virtual node. Thus, if the system consists of N virtual nodes, one τ is divided into N time frames of length θ = τ N . θ is also called the virtual node distance. Figure 5 shows an exemplary state of the computational node during one τ. One τ induces one reservoir state. The activation of the nth virtual node is dependent on the state of the computational node during the nth θ. 3.2.2 How are inputs fed into the system? In classical LSMs the input u(t) of time t is fed in parallel to all nodes. Due to temporal multiplexing in the PHOCUS LSM, the input is fed in a serial manner to the different virtual nodes, that is, to the computational node. The continuous input stream u(t) at time t is strechted using a sample&hold technique to the length of τ, such that I(k) = u(t) for k ∗ τ < t < (k + 1) ∗ τ. The input weights of classical LSMs are encoded by a mask function in the PHOCUS LSM. The mask function M(k) is a piece-wise constant over the length of one virtual node θ and periodic over τ, such that M(k+τ) = M(k). The values of M(k) over one θ are taken radomly from some probability distribution: for θ ∗ i < k < θ ∗ (i + 1) M(k) = Wres in,i. This ensures that the input weight for a single virtual node is constant over different inputs. If the input is one-dimensional, the input to the computational node is given by J(t) = I(t) ∗ M(t). If the input is multidimensional, a mask is created for every input dimension and the input of the computational node is given by the sum of the product of the masks and the respective input dimension: J(t) = j Ij (t) ∗ Mj (t). The different input dimensions are collapsed onto a 15
  • 16. Figure 5: Examplary state of the computational node during one τ. If N is the number of virtual nodes the reservoir consists of, then one τ is divided into N (in this case 10) bins. The state of the computational node at the end points of those bins depicted by the dashed lines denote the activation of the respective virtual node. one-dimensional stream J(t) which may cause information loss. In order to avoid that, one should enforce a constraint on the input weights (the different values Mj (t) can take), namely that, if you let A be a matrix and Ai,j denotes the ith input weight of input dimension j (Ai,j = Mj (i ∗ θ)), then rank(A) should be equal to the number of input dimensions. This constraint avoids information loss since the every input stream uj (t) could be reconstructed given A and J(t). 3.2.3 How is the reservoir input non-linearly transformed? The virtual nodes of the system are, as stated above, delay coupled dynamical systems. In order to understand how the computational node non-linearly transforms the reservoir input, one has to understand what delay-coupled dynamical systems are and how they function. Based on [9], there are two types of dynamical systems: iterated maps and differential equations. For this thesis, knowledge about the latter is sufficient. A differential equation describes a function where the value of the function 16
  • 17. Figure 6: Visualization of the masking process for an exemplary one dimen- sional input [-1,2,-2.5,1]. First the input is strechted to the length of τ (80 in this example). Then it is multiplied by a, in this case, bipolar mask which is periodic over τ and piece-wise constant over θ. The dashed lines denote the beginning of a new τ. is deterministically related to the derivative of the function, it, so to say, defines the rules of the evolution of a point in space over time. The space the point evolves in is called phase space. A typical example taken from the field of biology for a differential equation models the population of a certain species. The growth rate of a population (derivative of the total population) depends on the size of the total population. More cats create more kittens :). If, for example, the population increases by 10% of the total population in one time step, this could be expressed by f (t) = 0.1f(t). A natural question which might arise in the context of this simple scenario could be for example, what the population in s time steps is given an initial population f(t0) = x0 (Initial value problem). There are two different approaches to this problem. First, one could analytically solve the differential equation which can often be done by integration and compute the solution. This is not always possible. 17
  • 18. The second approach is to numerically approximate the solution. There are many methods to do that. One of them is called ”Heun’s method”. The general idea of Heun’s method is simple. If we want to approximate the function at f(t0 + s) with f(t0) = x0, we could evaluate the derivative at t0 and assume that it remains constant over s which leads to ˜f(t0 + s) = x0 + s ∗ f (x0), where ˜f(x) is the approximation of f(x). It is obvious that the derivative is in most cases not constant in the interval t0 ≤ t ≤ t0 +s and we can refine the approximated solution by dividing s into n parts and itera- tively compute ˜f(ti+1) using f (ti) with ti = ti−1 + s n n times until ti = t0 +s. By doing so, one only assumes that the derivative remains constant over s n instead of s. From that follows that if s was divided into infinitely many parts, the approximation ˜f(t0 + s) would be equal to real f(t0 + s). Another refinement is to estimate the average value of the derivative in the interval t0 ≤ t ≤ t0 + s. An admissable estimation of the average value of the deriva- tive if f (t0) and f (t0 + s) was known, would be 1 2 (f (t0) + f (t0 + s)). The problem is that in order to compute f (t0 +s), f(t0 +s) has to be known, but f(t0 + s) is exactly what we wanted to compute in the first place. A remedy to that problem is to estimate f(t0 +s) by ˜f(t0 +s), then use ˜f(t0 +s) to ap- proximate f (t0 + s) and use this estimate to approximate the average value of the derivative in the interval t0 ≤ t ≤ t0 + s. The two refinements can be combined and the resulting method is a 2-staged iterative process: Let ti+1 = ti + s n and f(t0) = x0. Compute f (ti) using f(ti) starting with ti = t0 and use this to estimate ˜f(ti+1) = f(ti) + s n ∗ f (ti). Then estimate ˜f (ti+1) using ˜f(ti+1) and then compute f(ti+1) = 1 2 (f (ti) + ˜f (ti+1)) ∗ s n + f(ti). Repeat those steps until ti = t0 + s. Up to now, we have investigated ordinary differential equations (ODEs) but the differential equation employed for the non-linear transformation in the reservoir is a delayed differential equation (DDE). The difference between an ordinary differential equation and a delayed differential equation is that the derivative at time t of a DDE is not only dependent on the value of the function at time t but also on delayed value(s) of the function. DDEs can be employed where the cause and effect relation is delayed, for example in a better model of the population size of a species where the sexual maturity of the individuals of the species is accounted for. Only an adult cat can create new kittens :). The delayed differential equation responsible for the non-linearity in the reser- voir is dependent on a single delayed function value, thus it is of the form 18
  • 19. f (t) = g(f(t), f(t − τ)). The phase space of such a DDE is infinitely dimen- sional because it depends on the continuous initial history in the interval t0 − τ ≤ t ≤ t0. Since past values of f(t) are known in the context of this application, the initial value problem can also be solved by Heun’s method. The DDE responsible for the non-linear projections in the reservoir is the Mackey-Glass equation but with a few changes: f (t) = ( η ∗ (f(t − τ) + γ ∗ J(t)) 1 + (f(t − τ) + γ ∗ J(t))p −f(t))/T, where J(t) is the external reser- voir input at time t from the definition earlier, η, γ, T and p are adjustable parameters. The non-linear transformations of a single virtual node are in- duced by ”following” the Mackey-Glass equation for a time θ (virtual node distance). 3.2.4 How are virtual nodes interconnected? Figure 7: The characterstic topology of a PHOCUS LSM. Strong weights form a ring structure but basically all neurons are interconnected. Addition- ally all neurons exhibit a self-loop. In order to understand how the virtual nodes are interconnected, what it means that two neurons are connected should be clarified first: neuron i is connected to neuron j, if the activation of neuron i somehow influences the activation of neuron j. Applying Heun’s method gives a very easy to understand explanation on how the virtual nodes are interconnected by in- ertia of the dynamical system: recall that the single computational node executes the computations of every virtual node during one τ. One τ is di- vided into N θs and the state of the ith virtual node is equal to the state of 19
  • 20. the computational node at the end of θi. The computational node follows the Mackey-Glass equation. Let us assume that the computational node has followed the Mackey-Glass equation until the end of θi (thus, t = n ∗ τ + i ∗ θ with n ∈ N) and suppose we now apply Heun’s method for following the Mackey-Glass equation for θ, namely f(t + θ) = f(t) + θ ∗ f (t). The fact that the (i + 1)th virtual node is coupled to the ith virtual node becomes apparent when examning f(t + θ) = f(t) + θ ∗ f (t), since f(t) and f(t + θ) are in this case the state of the computational node at the end of θi and θi+1, i.e. the activation of the ith and (i + 1)th virtual node respectively. The external reservoir input J(t) is injected by influencing f (t) (see definition of f (t)). One might now get the impression that the (i + 1)th virtual node is connected to the ith virtual node of the previous timestep since f(t − τ) is also part of f (t) but one should consider that f(t − τ + ) with 0 < ≤ θ denotes the activation of the (i + 1)th virtual node in the previous time step and a more fine-grained way to compute f(t + θ) would predominately take those values into account. Thus, the (i+1)th virtual node actually exhibits a self-loop. Consider now, that the computational node computes the state of the (i + 2)th virtual node. The (i + 2)th virtual is connected to the (i + 1)th virtual node, but because the (i + 1)th virtual node is connected to the ith virtual node, the (i + 2)th virtual node is also connected to the ith virtual node but to a lot smaller degree. To sum up, the topology of a PHOCUS LSM is very restricted, namely to a very specific ring structure where every neuron is connected to itself and basically all neurons are interconnected, but strong connections are present from the ith to the (i + 1)th virtual node. Analytically solving the differential equation and bringing the activation in the form of classical LSMs confirms these considerations. Let xi(k) be the activation of the ith neuron at time k, then the activation is given by xi(k) = e−iθ xn(k − 1) + i j=1 ∆ijf(xj(k − 1), u(k)) with ∆ij = (1 − θ)e−(i−j)θ This equation shows that the weight between neurons exponentially decreases with their distance. Furthermore, it shows that θ is of big importance for the coupling. In addition to θ T (timescale of the MG equation) seems to be of big importance. T scales the differential equation. If T is big, the step the dynamical system takes in each time step is small and vice versa. Let us again consider the simplest way to estimate the state of the dynamical system in one θ: f(t + θ) = f(t) + θ ∗ f (t). With a growing θ, the influence of f (t) on f(t+θ) increases, thus the degree of coupling to the previous node decreases (the influence of f(t) on f(t + θ)). In this case, the influence of the self-loop and the external input which are embedded in f (t) overshadow the coupling 20
  • 21. with the previous node. If θ is too small, the external input and the auto- coupling lose their influence and the coupling to the previous node becomes too strong which results in very similiar virtual node states, since, colloquial speaking, the differential equation did not have enough time to evolve. T seems to be usable to counter those effects. Empirical investigations back up these considerations and suggest that T = 5θ, θ = 0.2, τ = 80, γ = 0.05, η = 0.4 and N = 400 is a good setup which provides in addition to reasonable coupling to the previous node and self-loop also good integration of the external input. 3.2.5 How are is the output generated and how are the output weights trained? The state during one τ make up one reservoir state and the activation of the ith virtual node is equal to the state of the computational node after θi, thus if xi(k) is again the activation of the ith node at time k and f(t) is the reservoir state at ime t, then xi(k) = f((k−1)τ +iθ− ) where is very small compared to θ. accounts for the fact that the computations at i ∗ θ are already responsible for the activation of the (i + 1)th node. The activations of all reservoir nodes at time k can been seen as a vector x(k) and it has to be mapped to the desired or teacher output y(k). In most applications a linear transformation is employed. There are numerous linear models which can in principle be used. The model used in this thesis is called General Linear Model [1]. General Linear Models can be used under the assumption that the target output y(k) is deterministically related to the reservoir state x(k) with some error , thus y(k) = l(x(k), w) + . is in this case assumed to be Gaus- sian noise, i.e. it has zero mean and variance σ2 (precision of β = 1 σ2 ) and l is a linear function of the form l(x(k), w) = w0 + N i=1 wi ∗ xi(k), thus it is dependent on the model parameter vector w. Because of the fact that the error stems from a Gaussian distribution, we can write for the probability of a target y(k) given model parameters w, reservoir state x(k) and precision β: P(y(k)|w, x(k), β) = N(y(k)|l(x(k), w), β−1 ), with N(y|µ, σ2 ) = 1 √ 2πσ2 exp{− 1 2σ2 (y − µ)2 }. It immediately follows that for the expectation of the teacher output given a reservoir state holds that: E(y(k)|x(k)) = l(x(k), w) which will be helpful when determining the output of the system once the parameters w are learned. 21
  • 22. The main question remains how w is extracted from the data. Imag- ine the different teacher outputs y(k) are grouped into a column vector y with yk = y(k), so the kth component of y equals y(k) and imagine that the reservoir states are grouped into a matrix, where the kth row depicts x(k). We want to maximize the probability of y given x, w and β. Given that yk are independent and identically distributed, the joint probability is given by the product of the marginal probabilities, thus: P(y|x, w, β) = N i=1 P(yi|xi, w, β). P(y|x, w, β) is called the likelihood function and what we are trying is to find w = arg max w P(y|x, w, β). Since the logarithm is a monotonic function, maximizing the logarithm of the likelihood equals max- imizing the likelihood: arg max w P(y|x, w, β) = arg max w ln(P(y|x, w, β)) = arg max w ln( N i=1 P(yi|xi, w, β)) = arg max w N i=1 ln(P(yi|xi, w, β)). Plugging in the definition of P(yi|xi, w, β) yields w = arg max w N i=1 ln( 1 2πβ−2 exp{− 1 2β−2 (yi− l(w, xi))2 }). Simplifications result in: w = arg max w N 2 ln(β) − N 2 ln(π) − N i=1 β 2 (yi − l(w, xi))2 . It is obvious now that we can maximize the expression with respect to w by minimizing E(w) = N i=1 (yi −l(w, xi))2 = N i=1 (yi −wT xi)2 (we are omitting w0 in this case for simplicity, but one could imagine adding a leading column of 1s in x, in such a way that x0 ∗ w0 = w0). This equation is also called the sum-of-squares error function. In order to compute the minimum of E(w), we apply well known tech- niques to find the minimum analytically, namely we calculate the gradient and set it to 0. We begin by simplifying E(w): E(w) = N i=1 (yi − wT xi)2 = N i=1 (yi − j=0 wjxi,j)2 = N i=1 y2 i −2yi j=0 wjxi,j +( j=0 wjxi,j)2 . For the gradient in the kth direction now holds: E(w) wk = N i=1 (−2yixi,k + 2xi,k j=0 wjxi,j). 22
  • 23. Setting this to 0 and dividing by -2 yields: 0 = N i=1 (yixi,k − xi,k j=0 wjxi,j) = N i=1 yixi,k − N i=1 xi,k j=0 wjxi,j = N i=1 yixi,k − j=0 xT ∗,jx∗,kwj. Since the formula for each component in w is the same, we can rewrite the equation above to: 0 = xT y − xT xw. This step can easily be verified by deriving what holds for the kth row of 0 = xT y − xT xw. Solving for w now yields, w = (xT x)−1 xT y. (ΦT Φ)−1 ΦT is also called the Moore-Penrose pseudo inverse of the matrix Φ and can be seen as a generalization of the notion of matrix inverse for non-square matrices. 3.2.6 Why is weak stationarity a neccessary condition for learn- ability with LSMs? The question why detrending is essential to be able to model time series with LSMs has not yet been adressed. The answer to that question can be answered by looking at the assumption which was made when deriving the General Linear Model. In order to state the likelihood of the conditional probability of y as the product of the marginal probabilities the assumption that the yk are independent and identically distributed was made. If you now recall the definition of weak stationarity, namely that mean, variance and covariance functions are constant over time it becomes obvious why this is a neccessary condition: if the time series is not stationary, either the mean, variance or covariance functions must not be constant, thus they cannot be identically distributed which is a necessity for the General Linear Model to be applied. 4 Detrending techniques 4.1 What happens when no detrending is employed? In order to establish a baseline for the performance of the detrending tech- niques, the performance of the model without detrending has to be estab- lished. The performance will be evaluated by four criteria, the normalized root mean square error, correlation and the averaged potential gain and rapg which were defined in chapter 1. For the stock price prediction, the desired output (target) is equal to the input stream with a time shift such that yt = ut+1 if yt is the target and ut is the input stream. Thus, the General Linear Model is to find a relation between the reservoir state after injecting 23
  • 24. ut with ut+1. For the artificial tasks, the mapping between ut and yt is ap- proximated. See chapter 2 for the definitions of yt. The meta parameters of the system are set to: N = 400, τ = 80, θ = 0.2, γ = 0.05, η = 0.4, T = 1 and these settings are kept the same when evaluating the different detrending techniques. The time series is seperated into a training set (first 80%) and an evaluation set (last 20%). The model parameters w are estimated during the training phase and are solely based on the training set. The above mentioned performance criteria are computed solely on the basis of the evaluation set. 4.1.1 Results - no detrending - non-stationarity in mean Figure 8: The evaluation set (blue graph) and its prediction without detrend- ing (red graph) of a target with non-stationarity in mean. Figure 8 shows the evaluation set and its prediction of the target with non- stationarity in mean. One can easily see that the two graphs look similiar but that it seems like the predicted values are shifted downwards. These visual cues are supported by the performance measures: although the prediction and the actual values operate on two different levels, the correlation between them is quite high with 0.77. This supports the impression that the two graphs seem to oscillate around their means in tune. The huge NRMSE of 3.73 as well as the small rapg of 3.7 ∗ 10−4 reveal that the two graphs operate nevertheless on two different levels. Apparently, their means are 24
  • 25. not equal. Intuitively, it is possible to interprete the results in the following way: the mean of the target increases steadily (see Figure 1) and the model parameters were estimated in the first 80% of the target where the mean was still small. When predicting the last 20% of the target, the level of the value is underestimated. This theory is backed up by comparing the mean of the training set (0.5326) with the mean of the prediction (0.5314). The mean of the desired output (0.888) is dramatically underestimated. 4.1.2 Results - no detrending - non-stationarity in variance Figure 9: Section of evaluation set (blue graph) and its prediction (red graph) of a target with non-stationarity in variance and no detrending. Similiar results are found when modelling a target process with non- stationary variance. Figure 9 shows a section of the desired and actual output of the system. The actual values seem to vary stronger than their prediction but the directions of the variations seem to be in tune. Again, this considera- tion are backed up by the performance measured: the correlation between the prediction and the actual values is fairly high with 0.67 whereas the NRMSE with a value of 0.78 indicates shortcomings of the model. The high rapg value of 0.77 is very surprising but can probably explained in the following way: since the time series has a stationary mean and the prediction (mean of ∼0) seems to share the mean of the actual target (mean of ∼0) and because the oscilations are generally in tune (reinforced by fairly high correlation), the predictions of the directions (sign of yt+1 − yt) are pretty accurate. The rapg basically measures the accuracy of the prediction of the direction of the 25
  • 26. yt+1 − yt weighted by the actual distance of yt+1 and yt. The difference of variances can intuitively interpreted similiarly to the differences of the means of the time series with non-stationary mean: during the model estimation phase, the variance was a lot smaller than in the evaluation phase. The system has no means to learn the time-dependency of the mean or variance. 4.1.3 Results - no detrending - non-stationarity in mean and vari- ance Figure 10: Section of evaluation set (blue graph) and its prediction (red graph) of a target with non-stationarity in variance and mean without de- trending. The results of modelling a target with non-stationarity in mean and vari- ance are not surprising. The mean as well as the variance seem to be un- derestimated while the general direction of the variations of the predicted time series seem to be in accordance with the desired target. A fairly good correlation between the predicted and actual values of 0.56, a big NRMSE of 2.2 and an rapg of 0.117 reinforce these considerations. Figure 10 shows a plot of a section of the predicted target (red graph) and the desired target (blue graph). 4.1.4 Results - no detrending - natural economic time series The results when modelling the natural economic time series are more sur- prising: Figure 11 shows a section of the prediction and the actual values of the Google stock. One can see that the prediction seems to lag behind the 26
  • 27. Figure 11: Prediction without detrending of the Google stock (red graph) and the actual Google stock (blue graph). actual values. This intuition is reinforced when looking at the cross correla- tion. The highest cross correlation is found when shifting the prediction one time step into the future. But what does that mean? If one recalls that the input and desired output of the system are the same time series but shifted by one time step such that yt = ut+1 and the prediction lags one time step behind the actual values such that yt+1 ≈ yt, one can easily see that the system is basically just recreating the input series, since yt ≈ ut. Despite the fact, that the system is basically just recreating the input signal, the classical performance measures do not indicate shortcomings of the model: the corre- lation between the actual and predicted values is with 0.9978 very close to 1, whereas the NRMSE with 0.067 is close to 0. This extraordinary character- istic is explainable by the huge autocorrelation the time series exhibits. Note that if yt ≈ yt+1, the correlation between yt and yt measures basically the au- tocorrelation of yt with a time lag of 1. A plot of the autocorrelation function reveals that the autocorrelation is decaying very slowly. A slowly decaying autocorrelation function is a well known trend indicator[2] and detrending seems to be an admissable effort to overcome the extraordinary behavior of the system. Also the rapg with 0.37 seems to be quite high in comparison to the other models but still only 37% of the possible profits are made and the apg of 0.176 shows that the model does not seem to be profitable in a stock exchange environment. On average a profit of e 0.17 is made in every time step per stock. Considering that the transaction fee of most depots 27
  • 28. is approximately 0.23% [12] of the volume traded and the average price per stock is e 506 and a stock can on average be hold for 1.63 time steps before it must be sold, one loses e 0.87 per stock exchanged. Detrending has to show if the model performance can be significantly improved and if betting on the prediction can be made profitable. A vague intuitive explaination of these results requires more knowledge about the results of the different detrending techniques and is therefor moved to chapter 5. 4.1.5 Summary results - no detrending Table 1 shows a summary of the performance of the system without de- trending. One can easily see that the model performances for the different tasks are unsatisfactory and that the investigation of detrending techniques is justified. ns mean ns variance ns mean & var google corr 0.77 0.67 0.56 0.9978 NRMSE 3.73 0.78 2.2 0.067 apg −3.7 ∗ 10−5 0.115 0.0191 0.176 rapg −4.7 ∗ 10−4 0.77 0.1172 0.37 Table 1: Comparison of performance of the system when no detrending was employed 4.2 On the expressiveness of the results In the ongoing chapter the performance of the different detrending techniques will be evaluated and then compared to the case where no detrending was employed. The results can vary from trial to trial and there are basically two characteristics that cause variability in the performance results. The first source of variability is the random mask which correspond to the input weights in classical LSMs. These input weights are drawn from a probility distribution, thus they vary from trial to trial. In order to account for this variability, the experiments would have to be redone with different input masks. The problem with this approach is the fact, that the input weights influence the reservoir dynamics and recalculating the reservoir states is very time consuming. Computing the reservoir states of 7000 data points takes approximately 20 minutes on a Intel Core i3-2310m machine with 4GB of RAM. Experience has shown that different instantiations of bipolar input masks have only a very small influence on the performance of the system 28
  • 29. anyway. For the sake of computational tractability, the influence of the ran- dom input mask on the performance is assumed to be 0. The secound source of variability stems from the partition of the time series into the training and evaluation set. Typically, cross validation is em- ployed when evaluating the performance of machine learning systems, i.e. numerous partitions are made and the different results are averaged in order to cancel out the effects of arbitrarily dividing the time series. This would not require to recompute the reservoir states. But still, in the context of this experiments this cannot be done for the following reason: If we assume a stock exchange scenario where the trend may have an influence and we divide the time series in a way that the evaluation set is not the last part of the time series, we are basically incorperating future information which in a real world scenario would not be accessable. Note that the characteristics of the time series change over time and the gist of this thesis is to somehow get rid of the time dependence of the time series! We are not incorperating future information in a way that actual information of future values somehow influence past values (like filtering the time series with an acausal filter would do) but in a more subtle way. This concept is probably best understood with an example: Imagine a time series with linear trend is divided in such a way that the first and last 40% of the time series belong to the training set and the missing 20% are used to evaluate the model. In the previous section of this thesis, we got the impression that the system seems to assume that the variability and mean it has ”experienced” in the training set generalize to the evaluation set. In order to minimize the overall quadratic error of the training set, the system will have to treat the first part of the training set (first 40% of the time series) equal to the second part (last 40% of time series) and thus, assume the mean of the time series to be somehow between the mean of the first and second part of the training set. When the perfor- mance is now evaluated, although the time series still exhibits a trend and the system has no means to learn the time dependence of the time series, the performance will be quite good because the mean of the evaluation set actually is somewhere between the first and second part of the training set because of the linearity of the trend. All in all there is no tractable way to account for the variability of the results and thus we have no approach which is backed up by statistics to see if the performance of a detrending model actually lies outside of the vari- ability of the performance of the non-detrending model. What we can do is quantitatively compare the different models and assume declined or im- prove model performances if the model performances are drastically different. 29
  • 30. In general, the methodology of this thesis allows only for existentially quantified statements. Propositions of the kind ”this detrending technique can be used to improve the model performance for all time series with non- stationarity in mean” are by nature of an empirical instead of an analytical investigation not possible. But the complexity of such systems make analyti- cal investigations very hard and considering the fact that all natural sciences are based on empirical studies and induction, this approach is justified in the opinion of the author. On top of that, the interpretation of the em- pirical results can often be backed up by analytical considerations that are generalizable to other scenarios. 4.3 Bipolarized target In the previous chapter, we have learned that in order to enable a LSM to model time series, the yk (the target) have to be independent and identically distributed. If we assume a stock exchange scenario the information whether the stock value will rise or fall is sufficient to place a bet, thus a very straight forward approach is to in a way bipolarize the target. A target value of 1 denotes that the time series will rise in the next time step, 0s denote no change of the time series and a target value of −1 denotes that the target will fall. Mathematically speaking, the detrended target stream y becomes yt = sgn(yt+1 −yt). If we assume that the statistic of whether the time series will increase or decrease is constant over time, the mean and variance will be constant and thus we have successfully detrended the time series. We can’t expect the prediction to be exactly −1, 0 or 1, thus they have to be mapped to those values in order to evaluate the model: an arbitrary threshold is introduced and if a predicted value is below − it is said to be −1, if it is greater than it is set to 1 and all other values are mapped to 0. In this case = 0.25 was chosen. The threshold for the mapping could in a real world stock exchange application be obtained by reinforcement learning techniques. A value function for different values of could be learned by regarding which apg values different produce. So, in a way, this detrending technique could still be considered a parameter free technique. 4.3.1 Results - bipolarized target Table 2 shows the resulting performances with a bioplarized target given the different tasks. The last row depicts the percentage of 0’s in the prediction after it was mapped to -1, 0 or 1 with = 0.25. This value can in some way be interpreted as a measure of uncertainty. The actual target streams 30
  • 31. very rarely contain 0’s (around 0.1%) and the more 0’s are contained in a prediction the more uncertain the system is about a prediction. ns mean ns variance ns mean & var google apg 0.05 (+0.05) 0.043 (-0.072) 0.047 (+0.028) −6.37 ∗ 10−6 (-0.176) rapg 0.64 (+0.64) 0.29 (-0.45) 0.29 (+0.17) −1.34 ∗ 10−5 (-0.37) 0’s 30.7% 54% 51% 99.4% Table 2: Comparison of performance of the system when the target was bipolarized. The numbers in brackets denote the difference in performance to no detrending. The last row denotes the ratio of 0’s after the prediction was rounded. In the case of a target which is non-stationary in mean, the performance dramatically increases. 64% of all possible profits are made, while in 30% of the cases the prediction could not exceed the certainty threshold. The prob- lem that the mean was drastically underestimated seens to be overcome. The percentage of 0’s, thus the level of uncertainty increases even more with a target with non-stationary variance. The overall performance drops severely in comparison to no detrending which may be due to the high uncertainty of 54%: it is hard to make profits when you do not make bets. The results of modelling a target with non-stationary mean and variance are comparable. The apg increases probably because of the fact that the mean is not under- estimated anymore but the high level of uncertainty most likely encumbers high apg values. Bipolarizing the target when predicting the economic time series yields desastrous results. The huge uncertainty of 99.4% averts any profits. 4.4 Differencing Differencing is a well known detrending technique which is often employed when modelling time series with classical statistical models [8]. The con- cept behind differencing is very easy to understand and can be thought of as an extension to bipolarizing. Imagine two pairs of consecutive time points. Imagine that in both cases the time series rises but in one case it will only increase by a small amount but in the other case the time series makes a huge jump. After bipolarizing, both pairs would be represented by a 1. The reser- voir states which are of course also dependent on past inputs might be very diverse, nevertheless the linear readout neuron tries to map both of them to 1. In order to minimize the overall error, one of the two actual outputs 31
  • 32. may have to be pushed below the certainty threshold and in an unfortunate scenario this happens to the one which would generate a huge profit. When bipolarizing, the linear readout tries to maximize the probability of identi- fying whether or not the time series will rise or fall (see 3.2.5). But why not weigh this probability with the potential profit which could be generated when it is predicted correctly? Thus, multiplying the bipolarized target with the potential profit: yd t = sgn(yt+1 − yt)|yt+1 − yt| = yt+1 − yt. 4.4.1 Results - differencing ns mean ns variance ns mean & var google cor 0.72 (-0.05) 0.3 (-0.37) 0.293 (-0.26) 0.997 (-0.008) NRMSE 0.70 (-3.03) 1.09 (+0.31) 1.09 (-1.11) 0.074 (+0.007) apg 0.058 (+0.058) 0.058 (-0.05) 0.06 (+0.04) 0.0245 (-0.15) rapg 0.73 (+0.73) 0.39 (-0.38) 0.367 (+0.25) 0.052(-0.32) Table 3: Comparison of performance of the system when the target was differenced. The numbers in brackets denote the difference in performance to no detrending. In order to make the correlation and the NRMSE comparable, the differ- enced values were used to reconstruct the resulting time series in a way that previous errors do not accumulate, i.e. for the reconstructed time series yr t series holds: yr t = yt + yd t (and not yr t = y0 + t i=0 yd i ). Differencing yields fairly good results when trying to predict a target with non-stationary mean. 73% of al possiblel profits could have been made in a stock exchange scenario. The NRMSE dropped by the magnitude of ap- proximately 5 in comparison to no-detrending but is still fairly high with 0.7. The pretty good results can be explained by recalling how the non- stationary mean was induced (see 2.2.1 for a definition). A linear time de- pendency was added to the existing NARMA10 task. The time dependency responsible for the linear trend vanishes by differencing, since: y1 t+1 − y1 t = yt+1 + 0.0001(t + 1) − (yt + 0.0001t) = yt+1 − yt + 0.0001. What should be pointed out here is that the first difference of the target with non-stationary mean is equal to the first difference of an ordinary NARMA10 task with a small shift in mean. The NRMSE for the ordinary NARMA10 task is re- ported to be 0.15 [3]. Thus, the representation of the target as the difference seems to be harder to learn since the NRMSE more than quadruples. When the time dependency is multiplicative, i.e. when the corresponding 32
  • 33. time series exhibits (at least) non-stationarity in variance, it is not possible to overcome it by differencing: y2 t+1 −y2 t = yt+1(t+1)−ytt (scaling coefficient omitted for simplicity). It is easy to see that it is not possible to abolish the dependence on t in this case. This explains the bad performance for the target with non-stationary variance seen in table 3. The impression that differencing seems to make learning the NARMA10 task harder is reinforced considering the fact that the performance actually drops and does not remain the same. The overall performance of learning a target with non-stationary mean and variance increases in terms of NRMSE and rapg but is impaired in terms of correlation. These findings are not very surprising, considering the fact that the additive time dependency is abolished by differencing: y3 t+1 − y3 t = yt+1(t + 1) − ytt + 0.0001 (scaling coefficient again omitted for simplicity). The resulting time series is stationary in mean, thus the problem of underes- timating the mean is overcome which improves the NRMSE drastically as we have seen earlier. The fact that the correlation decreases is again in accor- dance to the impression that learning the differenced NARMA10 task seems to be harder for the system than the non-differenced NARMA10. Differencing the natural economic time series yields very unsatisfactory results. The performance drops with respect to all performance measure- ments. The correlation and NRMSE still suggest an admissable model, but the rapg reveals that betting on the model would generate almost no profits. Note that these results emphasize the importance of the introduction of the additional performance measurement rapg, since the classical performance measurements fail to reveal shortcomings of the model in a stock exchange scenario. Interpreting these results is very hard since little is known about the characteristics of the time series. But one should recognize that the sys- tem was unable to find a relationship of past values with the difference of future values (yt+1 − yt). This fact will in a later argumentation be useful. 4.4.2 Implicit assumptions when differencing Classical stastical approaches suggest using the second difference, i.e. the difference of the differenced series y2d t+1 = ((yt+1 − yt) − (yt − yt−1)), if a time series cannot be made stationary by taking the first difference [8]. We have seen above that one implicitly assumes a linear additive trend when taking the first difference. But what do we implicitly assume when using a second difference? It is easy to see, that the second difference successfully abolishes a quadratic additive trend: y2d t+1 = (((yt+1 +(t+1)2 )−(yt +t2 ))−((yt +t2 )− (yt−1 + (t − 1)2 ))) = yt+1 − 3yt + 3yt−1 − yt−2 + 2. 33
  • 34. To sum up, differencing is not assumption free and it is only guaranteed to work if there is a either a linear (first difference) or quadratic trend (second difference). In order to successfully detrend a time series by differencing, the time series at hand has to be analyzed and the type of non-stationarity has to be identified. 4.5 Log differencing Classical stastical approaches often transform the time series by taking the logarithm first and then differencing is applied [8]. For the resulting target stream holds then: yld t+1 = log(yt+1) − log(yt). This approach bears a fun- damental problem: the logarithm of a negative value is not defined in the real numbers and since the PHOCUS LSM operates in the reals, the log- difference is only defined for positive time series. The resulting performance of log-differencing can therefore only be evaluated by the natural economic time series, since it is the only strictly positive time series. When applying log-differencing the identification of the implicit assumption is not such an easy endeavor. What we can show is that the time dependency has to vanish by dividing instead of subtracting two successive data points in comparison to normal differencing. If we assume a linear multiplicative trend (→ (at least) non-stationarity in variance): yld t+1 = log(yt+1(t+1))−log(ytt) = log( yt+1 yt + yt+1 ytt ), the time dependency is not fully abolished but the influence of t declines with a growing t. 4.5.1 Results - log-differencing The reults for predicting the log-differenced target of the natural economic time series are disillusioning: the NRMSE and correlation still misleadingly indicate a very good model whereas the rapg of -0.33 reveals the desastrous ef- fects betting on that model would have generated. Note that in order to make the performance measurements comparable the actual time series was recon- structed by reversing the detrending operations. To sum up, log-differencing requires the time series to be positive and it is not easily comprehensible when log-differencing is guaranteed to abolish a time dependency. 34
  • 35. google cor 0.997 (-0.008) NRMSE 0.074 (+0.007) apg −0.15 (-0.32) rapg −0.33(-0.7) Table 4: Comparison of performance of the system when the target was log- differenced. The numbers in brackets denote the difference in performance to no detrending. 4.6 High-pass filter 4.6.1 What kind of non-stationarities are induced by low-frequency oscillations? The next detrending technique is inspired by signal processing. If we assume that non-stationarities are induced by low-frequency oscillations, high-pass filtering the time series could be an admissable technique to detrend a time series. A high-pass filter cuts off frequencies below a certain threshold. Fre- quencies above that threshold can pass the filter almost undamped. But how sound is the assumption that low-frequency oscillations induce non- stationarities? In order to investigate the effect of high-pass filtering, the following steps are taken: 1. Transformation of the time series into frequency domain. 2. Cut-off of low frequencies. 3. Investigation of the effect of cutting off low frequencies. Fourier transform and Inverse Fourier transform are employed to transform the series into frequency and time domain respectively. Imagine a stationary time series is superposed by an additive trend of the form atn , such that: ynon stat t = ystat t + atn . In the following part it will be shown that an additive trend of the form atn for n ∈ N and n < 4 will predominately effect the amplitude of the low frequency spectrum and that therefore high pass filtering is a sound approach to abolish the time dependency of the time series. Since the trend component is additive, the two components can be Fourier transformed and high-pass filtered indepen- dently. For the amplitude of frequency ω A(ω) holds A(ω) = |F(ω)|. [14] Case 1: n = 1: Since t ≥ 0, we can write for the Fourier transform of at: Fn=1(ω) = ∞ 0 ate−2πitω dt = − a (2πω)2 . [13] With a growing ω, |F(ω)| decreases, thus after high-pass filtering with a suitable cut-off frequency, ∀ωFhp n=1(ω) ≈ 0, because the amplitudes of the low-frequency spectrum are set to 0 and for 35
  • 36. big values of ω holds that Fn=1(ω) ≈ 0 anyways. Let Fhp ystat (ω) be the high-pass filtered Fourier transformed image of the stationary component, then it holds that Fhp ynon stat (ω) = Fhp ystat (ω)+Fhp n=1(ω) ≈ Fhp ystat (ω). Hence, the time-dependent component vanishes by high-pass fil- tering and therefore high pass filtering is an admissable approach to diminish the effect of at on ynon stat t . The argumentation for 2 ≤ n < 4 are analogous and only the Fourier trans- formation of atn will be given: Case 2: n = 2: Fn=2(ω) = ∞ 0 at2 e−2πitω dt = − ia 4π3ω3 . [13] Thus, An=2(ω) = a 4π3ω3 de- creases strongly with a growing ω. Case 3: n = 3: Fn=3(ω) = ∞ 0 at3 e−2πitω dt = − 3a 8π4ω4 . [13] Again, the greater ω, the smaller An=3(ω). Thus the additive trend at3 predominately effects the low frequency spectrum of ynon stat t which is cut-off by high-pass filtering. These consider- ations can probably be extended to all n which is left as an exercise to the reader (the proof involves solving ∞ 0 atn e−2πitω dt). Figure 12: High pass filtered target with non-stationary mean. These theoretical considerations are backed up by investigating the plot of a high-pass filtered target with non-stationary mean. Figure 12 shows the plot of the high-pass filtered target with non-stationary mean defined in 2.2.1. The linear upward trend seems to have vanished. Let us now consider what happens if a target with a multiplicative time- depency of the form atn is high-pass filtered. Multiplication in the time domain is equal to convolution in the frequency domain [14] and this poses a direct problem: although the time-dependent component is approximately 0 in the frequency domain (∀ωFhp n (ω) ≈ 0), it holds that Fhp ynon stat (τ) = 36
  • 37. ∞ −∞ Fhp ystat (ω)Fhp n (ω − τ)dτ = Fhp ystat (τ), thus the time-dependent component still has an influence and the detrending seems to have failed. This consid- erations can again be backed up by investigating the plot of the high-pass filtered target with non-stationary variance defined in 2.2.2. Figure 13: High pass filtered target with non-stationary variance. It is clearly visible that the variance of the high-pass filtered target with non-stationary variance still seems to increase over time and that detrending has failed. These findings are in accordance to the theoretical considerations earlier. 4.6.2 Characteristics of the high-pass filter at hand The high-pass filter employed for detrending in this thesis was created using the fdatool included in MATLAB. A causal equiripple finite-impulse filter with a cutoff and pass frequencies of 10% and 26% of the sampling frequency respectively was chosen. fdatool is used to compute the filter coefficients hi. The filtered signal is equal to the discrete convolution of the time series and the filter coefficients, such that yfiltered t = n i=0 hiyt−1. Note that the high-pass filtering takes place in the time domain and the cutoff and pass frequencies are parameters and were in this case chosen by hand. Thus high-pass filter- ing is neither parameter nor assumption free: we assume an additive trend of some form and the cutoff and pass frequencies have to be chosen in accor- dance to the steepness of the additive trend. The cutoff frequency ωcut off should be chosen in such a way, that ∀ω > ωcut off Fn(ω) ≈ 0. A rule of thumb seems to be that the steeper the trend, the lower the cutoff frequency. This relation can be verified by investigating the definitions of the frequency spectra of the trend components. 37
  • 38. The order of the filter (the number of filter coefficients) of the resulting filter is 35. Thus after filtering, the target to predict incorporates informa- tion of the last 35 time steps. On top of that, since the artificial tasks are derivations of the NARMA10 task, they incorporate information of the 10 last inputs, thus in order for the system to be able to predict the target cor- rectly, the information of 45 last inputs must be available in the reservoir. Experimental results suggest that this exceeds the memory capacity of the reservoir. In order to overcome this problem, the input as well as the target are high-pass filtered. Doing so, significantly improved the performance be- cause by high pass filtering the input series, information of the last 35 inputs is made present in the current input. 4.6.3 Reconstructing the time series After the high-pass filtered target is predicted, the corresponding time series (or at least the information if the time series will rise) has to be reconstructed from that prediction. For this purpose, the filter operations are inversed by solving yfiltered t = n i=0 hiyt−i for yt which yields: yt = (yfiltered t − n i=1 yt−ihi)/h0, with yfiltered t being the prediction of the filtered target. In order to avoid accumulating the error of past time steps yt−i (which are available at time t) instead of the reconstructed prediction yt−i are used. 4.6.4 Results - high-pass filtering In order to make the correlation and NRMSE comparable, table 5 shows the performance measures after reconstructing the corresponding time series from the high-pass filtered prediction. ns mean ns variance ns mean & var google cor 0.05 (-0.72) -0.03 (-0.7) -0.027 (-0.58) 0.358 (-0.64) NRMSE 31.97 (+28.24) 83.1 (+81.41) 84.76 (+82.5) 3.32 (+3.25) apg 0.005 (+0.005) -0.004 (-0.119) -0.0034 (-0.02) 0.011 (-0.16) rapg 0.062 (+0.062) -0.027 (-0.79) -0.02 (-0.14) 0.023(-0.35) Table 5: Comparison of performance of the system when the target was high- pass filtered and reconstructed afterwards. The numbers in brackets denote the difference in performance to no detrending after the corresponding time series was reconstructed. Predicting the high-pass filtered target led to horrible performances for all tasks. What is peculiar is the fact that for all artificial tasks the NRMSE 38
  • 39. exploded and the correlation dropped to approximately 0. However, the performance for the predicting the economic time series are better in com- parison to the artificial tasks but still bad in comparison to other detrending techniques. How can these results be interpreted? Why did this detrending technique fail so hard? Let us have a look at the performance of the system before the actual time series has been reconstructed displayed in table 6. ns mean ns variance ns mean & var google cor 0.925 0.438 0.45 0.658 NRMSE 0.38 0.908 0.93 0.77 Table 6: Performance when predicting the high-pass filtered target before the actual time series is reconstructed by inverting the filter operations. The performance of predicting the filtered target before reconstructing the actual time series seems to be a lot better. Especially predicting the filtered target with non-stationary mean seems to work well. The impression of good performance when predicting the filtered target with non-stationary mean is in accordance to the theoretical considerations earlier, namely that an additive trend of the form atn can be abolished by high-pass filtering. The reason of the bad performance seems not to reside in predicting the filtered target but in reconstructing the actual time series. When recon- structing the time series the error is scaled and thus amplified. Consider that the prediction of the target is equal to the desired target and some error : yt filtered = yfiltered t + . If we now recall how the actual time se- ries was reconstructed and we plug in the definition of yt filtered , it is easy to see that the error is being amplified: yt = ((yfiltered t + ) − n i=1 yt−ihi)/h0 = yfiltered t /h0 + /h0 − n i=1 yt−ihi/h0. The only source of error is in this case and for the high-pass filter used it holds that h0 = −0.0084, thus the error is amplified by a factor of approximately 120. The question of how to reconstruct the actual time series or how to extract the information whether or not the time series will increase is a fundamental problem when predicting a high-pass filtered target. A straight-forward idea would be to also try to predict the residuals of the filtered time series: yres t = yt − yfiltered t . This idea is flawed because the residuals contain the time dependent component of the time series and can therefore of course not be predicted by the PHOCUS LSM. Another straight forward approach would be to create an exclusive model of the residuals. If we, for example, assume 39
  • 40. a linear additive trend: yres t = c ∗ t + b + , by linear regression it would be possible to estimate the parameters c and b in the evaluation phase. In theory, this seems to be a sound approach but one cannot expect the residuals to solely contain the time-dependent component. Figure 14 shows a plot of the residuals of the high-pass filtered artificial task with non-stationary mean. One can see that a linear model for the residuals is very coarse and would not lead to satisfying results. Figure 14: Residuals of a high-pass filtered target with a linear additive trend. 5 Conclusion 5.1 Summary This thesis investigated three kinds of detrending techniques, namely bipolar- izing, differencing and high-pass filtering. Non of the investigated detrending techniques is assumption and parameter free. Bipolarizing can be considered assumption and parameter free if means to obtain the decision threshold are available (e.g. reinforcement learning). But still, this thesis was not able to find a detrending technique which gives in conjunction with a PHOCUS LSM good results in any scenario. There seems not to be a fire-and-forget detrending technique that can handle non-stationary mean or variance or both. The type of non-stationarity has to be indentified in order to chose a suitable detrending technique. Because of the fact that a detrending technique has to be chosen depend- ing on the type of non-stationarity at hand, in the following paragraph the findings of this thesis will be summarized and a recommendation which de- trending technique to employ is given for each type of non-stationarity. The 40
  • 41. following considerations are based on the assumption that the incentive in modelling is ultimately betting on the target, thus that the rapg is the most informative performance measure. If one encounters a target of which one can safely assume that it exhibits a linear or quadratic additive trend, first and second differencing respectively seem to be the most promising detrend- ing techniques (experimental rapg of 0.73 [linear additive trend]). If the time dependence is multiplicative and the mean is constant, it is recommended not to detrend at all. The performance without detrending seems to be already good (experimental rapg of 0.77) and detrending seems only to do harm. However, if a time series exhibits a multiplicative as well as an additive time dependent component, differencing is recommended to get at least rid of the additive sub-component (experimental rapg of 0.36). 5.2 On the natural time series Non of the detrending techniques was able to improve the profit betting on the prediction of the natural time series would have generated. Considering the fact that detrending was able to improve the performance for the artificial tasks (non-stationary mean and mean & variance) or the performance was already quite good (non-stationary variance), the assumption that detrend- ing caused the unsatisfying performance is questionable. Of course, the type of time dependence could also be of some non-linear form which was not con- sidered in this thesis. But, on top of that, the extra-ordinary behavior when predicting the natural time series, namely that the input series seems to be recreated (i.e. the actual output is one time step behind the desired), is still unexplained: One core statistical model for time series is called Autoregres- sive model (AR): xt = c+ p i=1 φixt−1+ t. [2] φi are called model parameters, p is the order of the AR-model, t are white noise (mean 0, standard deviation σ ) and c is a constant. Thus, an autoregressive model can be seen as a filter. In 3.1.3, we have learned that reservoir computers are in principle capable of approximating every time invariant filter with fading memory. Consider now an AR(1) (order 1) process yt = φyt−1 + t and imagine an LSM which per- fectly predicts this process, thus the LSM predicts the expectation of yt with perfect knowledge of φ: E[yt] = E[φyt−1 + t] = E[φyt−1]+E[ t] = φyt−1, be- cause E[ t] = 0 (white noise). The perfect prediction of yt is a scaled version of yt−1! If the natural time series follows an AR(1) process than its perfect prediction is a scaled version of the current value, i.e. the perfect prediction seems to reconstruct the input series. There is not enough structure in the time series to be properly predicted, i.e. the ratio of t to yt is not good 41
  • 42. enough. This theory explains why the prediction seems to lag behind the actual values and why detrending did not bring about the desired effects. 42
  • 43. A Normalized root mean square error Let y and y be vectors of length n, σ2 y be the variance of y, then the normalized root mean square error (NRMSE) is defined as: NRMSE(y, y) = n i=0(yi − xi)2 nσ2 y B On NRMSE and correlation The fact that the NRMSE and correlation are not always very informative in a stock exchange scenario is best explained by an example. Imagine a time series with a linear upward trend where the first 50% of the time series are below its mean and the last 50% above. Given a prediction, the correlation between the actual time series and its prediction is a measure of the mutual oscillation around their respective means. Imagine both time series, the ac- tual target and its prediction, share their mean and the stock decreases from time point t to t+1. If the model predicts that the time series will increase by a very tiny amount, the correlation might be still high if the value of the prediction for t+1 is still below the mean although betting on that prediction would generate a loss. At the same time, the NRMSE error might also not be very informative since it basically is a normalized measure of the distance of two time series. Imagine two distinct predictions and a time series which slightly decreases from time t to t+1 by . Prediction1 might correctly be predicting a decrease of the time series but is dramatically overestimating it whereas prediction2 might be predicting a slight increase. The NRMSE of prediction2 might be smaller than the NRMSE of prediction1 although in this case dramati- cally overestimating a decrease would still generate a profit whereas slightly overestimating might generate a deficit. C Signum function sgn(x) :=    +1 x > 0 0 x = 0 −1 x < 0 [15] 43
  • 44. References [1] Christopher M. Bishop, Pattern Recognition and Machine Learning Springer, 2006 [2] T. Mills, N. Markellos, The Econometric Modelling of Financial Time Series Cambridge University Press, 2008 [3] Towards a PHOtonic liquid state machine based on delay- CoUpled Sys- tems, Deliverable D4, 2010 [4] Tom M. Mitchell, Machine Learning. McGraw-Hill Sci- ence/Engineering/Math, 1997 [5] Mauk, M.D. & Buonomano, D.V. The neural basis of temporal processing Annu. Rev. Neurosci. 27, 2004 [6] Cover, T.M. Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Transactions on Electronic Computers EC-14, 1965 [7] D. Buonomano, W. Maass, State-dependent computations: Spatiotem- poral processing in cortical networks, Nature Reviews in Neuroscience, Volume 10, 2009 [8] Alan Pankratz, Forecasting With Univariate Box- Jenkins Models John Wily & Sons., 1983 [9] Steven H. Strogatz, Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering Westview Press, 1994 [10] A. F. Atiya and A. G. Parlos, New Results on Recurrent Network Train- ing: Unifying the Algorithms and Accelerating Convergence IEEE Trans. Neural Networks, vol. 11, 2000. [11] Wolfgang Maass , Henry Markram On the Computational Power of Cir- cuits of Spiking Neurons 2004 [12] http://www.online-broker-vergleich.de/vergleich.php, 14.12.2011, 15:00 [13] http://www.wolframalpha.com/, 15.12.2011, 15:00 [14] http://en.wikipedia.org/wiki/Fourier transform, 15.12.2011, 15:00 [15] http://de.wikipedia.org/wiki/Signum (Mathematik), 15.12.2011, 15:00 44
  • 45. [16] H. Jaeger, The ”echo state” approach to analysing and training recurrent neural networks, GMD Report 148, German National Research Center for Information Technology, 2001 [17] W. Maass, T. Natschlger, and H. Markram A model for real-time com- putation in generic neural microcircuits, Proc. of NIPS 2002, Advances in Neural Information Processing Systems, MIT Press, 2003. 45
  • 46. Hereby I confirm that I wrote this thesis independently and that I have not made use of any other resources or means than those indicated. Hiermit best¨atige ich, dass ich die vorliegende Arbeit selbst¨andig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. (Ort, Datum) (Unterschrift) 46