SlideShare a Scribd company logo
1 of 83
Download to read offline
Computational Intelligence Methods
for Time Series Prediction
Second European Business Intelligence Summer
School (eBISS 2012)
Gianluca Bontempi
Département d’Informatique
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di
Computational Intelligence Methods for Prediction – p. 1/83
Outline
• Introduction (5 mins)
• Notions of time series (30 mins)
• Machine learning for prediction (60 mins)
• bias/variance
• parametric and structural identification
• validation
• model selection
• feature selection
• Local learning
• Forecasting: one-step and multi-step-ahed (30 mins)
• Some applications (30 mins)
• time series competitions
• chaotic time series
• wireless sensor
• biomedical
• Future directions and perspectives (15 mins)
Computational Intelligence Methods for Prediction – p. 2/83
ULB Machine Learning Group (MLG)
• 8 researchers (2 prof, 5 PhD students, 4 postdocs).
• Research topics: Knowledge discovery from data, Classification, Computational
statistics, Data mining, Regression, Time series prediction, Sensor networks,
Bioinformatics, Network inference.
• Computing facilities: high-performing cluster for analysis of massive datasets,
Wireless Sensor Lab.
• Website: mlg.ulb.ac.be.
• Scientific collaborations in ULB: Bioinformatique des génomes et des réseaux
(IBMM), CENOLI (Sciences), Microarray Unit (Hopital Jules Bordet), Laboratoire
de Médecine experimentale, Laboratoire d’Anatomie, Biomécanique et
Organogénèse (LABO), Service d’Anesthesie (ERASME).
• Scientific collaborations outside ULB: Harvard Dana Farber (US), UCL Machine
Learning Group (B), Politecnico di Milano (I), Universitá del Sannio (I), Helsinki
Institute of Technology (FIN).
Computational Intelligence Methods for Prediction – p. 3/83
ULB-MLG: recent projects
1. Adaptive real-time machine learning for credit card fraud detection
2. Discovery of the molecular pathways regulating pancreatic beta cell dysfunction
and apoptosis in diabetes using functional genomics and bioinformatics: ARC
(2010-2015)
3. ICT4REHAB - Advanced ICT Platform for Rehabilitation (2011-2013)
4. Integrating experimental and theoretical approaches to decipher the molecular
networks of nitrogen utilisation in yeast: ARC (2006-2010).
5. TANIA - Système d’aide à la conduite de l’anesthésie. WALEO II project funded
by the Région Wallonne (2006-2010)
6. "COMP2
SYS" (COMPutational intelligence methods for COMPlex SYStems)
MARIE CURIE Early Stage Research Training funded by the EU (2004-2008).
Computational Intelligence Methods for Prediction – p. 4/83
Time series
Definition A time series is a sequence of observations, usually ordered in
time.
Examples of time series
• Weather variables, like temperature, pressure
• Economic factors.
• Traffic.
• Activity of business.
• Electric load, power consumption.
• Financial index.
• Voltage.
Computational Intelligence Methods for Prediction – p. 5/83
Why studying time series?
There are various reasons:
Prediction of the future based on the past.
Control of the process producing the series.
Understanding of the mechanism generating the series.
Description of the salient features of the series.
Computational Intelligence Methods for Prediction – p. 6/83
Univariate discrete time series
• Quantities, like temperature and voltage, change in a continuous way.
• In practice, however, the digital recording is made discretely in time.
• We shall confine ourselves to discrete time series (which however take
continuous values).
• Moreover we will consider univariate time series, where one type of
measurement is made repeatedly on the same object or individual.
Computational Intelligence Methods for Prediction – p. 7/83
A general model
Let an observed discrete univariate time series be y1, . . . , yT . This means
that we have T numbers which are observations on some variable made at T
equally distant time points, which for convenience we label 1, 2, . . . , T.
A fairly general model for the time series can be written
yt = g(t) + ϕt t = 1, . . . , T
The observed series is made of two components
Systematic part: g(t), also called signal or trend, which is a determistic
function of time
Stochastic sequence: a residual term ϕt, also called noise, which follows a
probability law.
Computational Intelligence Methods for Prediction – p. 8/83
Types of variation
Traditional methods of time-series analysis are mainly concerned with
decomposing the variation of a series yt into:
Trend : this is a long-term change in the mean level, eg. an increasing trend.
Seasonal effect : many time series (sale figures, temperature readings) exhibit
variation which is seasonal (e.g. annual) in period. The measure and the
removal of such variation brings to deseasonalized data.
Irregular fluctuations : after trend and cyclic variations have been removed
from a set of data, we are left with a series of residuals, which may or
may not be completely random.
We will assume here that once we have detrended and deseasonalized the
series, we can still extract information about the dependency between the
past and the future. Henceforth ϕt will denote the detrended and
deseasonalized series.
Computational Intelligence Methods for Prediction – p. 9/83
320340360
observed
320340360
trend
−3−1123
seasonal
−0.50.00.5
1960 1970 1980 1990
random
Time
Decomposition of additive time series
Decomposition returned by the R package forecast.
Computational Intelligence Methods for Prediction – p. 10/83
Probability and dependency
• Forecasting a time series is possible since future depends on the past or
analogously because there is a relationship between the future and the
past. However this relation is not deterministic and can hardly be written
in an analytical form.
• An effective way to describe a nondeterministic relation between two
variables is provided by the probability formalism.
• Consider two continuous random variables ϕ1 and ϕ2 representing for
instance the temperature today and tomorrow. We tend to believe that
ϕ1 could be used as a predictor of ϕ2 with some degree of uncertainty.
• The stochastic dependency between ϕ1 and ϕ2 is resumed by the joint
density p(ϕ1, ϕ2) or equivalently by the conditional probability
p(ϕ2|ϕ1) =
p(ϕ1, ϕ2)
p(ϕ1)
• If p(ϕ2|ϕ1) = p(ϕ2) then ϕ1 and ϕ2 are not independent or equivalently
the knowledge of the value of ϕ1 reduces the uncertainty about ϕ2.
Computational Intelligence Methods for Prediction – p. 11/83
Stochastic processes
• A discrete-time stochastic process is a collection of random variables ϕt,
t = 1, . . . , T defined by a joint density
p(ϕ1, . . . , ϕT )
• Statistical time-series analysis is concerned with evaluating the
properties of the probability model which generated the observed time
series.
• Statistical time-series modeling is concerned with inferring the
properties of the probability model which generated the observed time
series from a limited set of observations.
Computational Intelligence Methods for Prediction – p. 12/83
Strictly stationary processes
• Definition A stochastic process is said to be strictly stationary if the joint
distribution of ϕt1
, ϕt2
, . . . , ϕtn
is the same as the joint distribution of
ϕt1+k, ϕt2+k, . . . , ϕtn+k for all n, t1, . . . , tn and k.
• In other words shifting the time origin by an amount k has no effect on
the joint distribution which depends only on the intervals between
t1, . . . , tn.
• This implies that the distribution of ϕt is the same for all t.
• The definition holds for any value of n.
• Let us see what does it mean in practice for n = 1 and n = 2.
Computational Intelligence Methods for Prediction – p. 13/83
Properties
n=1 : If ϕt is strictly stationary and its first two moments are finite, we have
µt = µ σ2
t = σ2
n=2 : Furthermore the autocovariance function γ(t1, t2) depends only on the
lag k = t2 − t1 and may be written by
γ(k) = Cov[ϕt, ϕt+k]
In order to avoid scaling effects, it is useful to introduce the
autocorrelation function
ρ(k) =
γ(k)
σ2
=
γ(k)
γ(0)
Computational Intelligence Methods for Prediction – p. 14/83
Weak stationarity
• A less restricted definition of stationarity concerns only the first two
moments of ϕt
Definition A process is called second-order stationary or weakly stationary
if its mean is constant and its autocovariance function depends only on
the lag.
• No assumptions are made about higher moments than those of second
order.
• Strict stationarity implies weak stationarity.
• In the special case of normal processes, weak stationarity implies strict
stationarity.
Definition A process is called normal is the joint distribution of
ϕt1
, ϕt2
, . . . , ϕtn
is multivariate normal for all t1, . . . , tn.
Computational Intelligence Methods for Prediction – p. 15/83
Purely random processes
• It consists of a sequence of random variables ϕt which are mutually
independent and identically distributed. For each t and k
p(ϕt+k|ϕt) = p(ϕt+k)
• It follows that this process has constant mean and variance. Also
γ(k) = Cov[ϕt, ϕt+k] = 0
for k = ±1, ±2, . . . .
• A purely random process is strictly stationary.
• A purely random process is sometimes called white noise particularly by
engineers.
Computational Intelligence Methods for Prediction – p. 16/83
Example: Gaussian purely random
0 10 20 30 40 50 60 70 80 90 100
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
Gaussian purely random
Computational Intelligence Methods for Prediction – p. 17/83
Example: Uniform purely random
0 10 20 30 40 50 60 70 80 90 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Uniform purely random
Computational Intelligence Methods for Prediction – p. 18/83
Random walk
• Suppose that wt is a discrete, purely random process with mean µ and
variance σ2
w.
• A process ϕt is said to be a random walk if
ϕt = ϕt−1 + wt
• If ϕ0 = 0 then
ϕt =
t
i=1
wi
• E[ϕt] = tµ and Var [ϕt] = tσ2
w.
• As the mean and variance change with t the process is non-stationary.
Computational Intelligence Methods for Prediction – p. 19/83
Random walk (II)
• The first differences of a random walk given by
∇ϕt = ϕt − ϕt−1
form a purely random process, which is stationary.
• The best-known examples of time series which behave like random
walks are share prices on successive days.
Computational Intelligence Methods for Prediction – p. 20/83
Ten random walks
Let w ∼ N(0, 1).
0 50 100 150 200 250 300 350 400 450 500
−40
−30
−20
−10
0
10
20
30
40
50
60
Random walks
Computational Intelligence Methods for Prediction – p. 21/83
Autoregressive processes
Suppose that wt is a purely random process with mean zero and variance
σ2
w. A process ϕt is said to be an autoregressive process of order p (also an
AR(p) process) if
ϕt = α1ϕt−1 + · · · + αpϕt−p + wt
Note that this is like a multiple regression model where ϕ is regressed not on
independent variables but on its past values (hence the prefix “auto”).
Computational Intelligence Methods for Prediction – p. 22/83
First order AR(1) process
If p = 1, we have the so-called Markov process AR(1)
ϕt = αϕt−1 + wt
By substitution it can be shown that
ϕt = α(αϕt−2 + wt−1) + wt = α2
(αϕt−3 + wt−2) + αwt−1 + wt =
= wt + αwt−1 + α2
wt−2 + . . .
Then
E[ϕt] = 0 Var [zt] = σ2
w(1 + α2
+ α4
+ . . . )
Then if |α| < 1 the variance if finite and equals
Var [ϕt] = σ2
ϕ = σ2
w/(1 − α2
)
and the autocorrelation is
ρ(k) = αk
k = 0, . . . , 1, 2
Computational Intelligence Methods for Prediction – p. 23/83
General order AR(p) process
We can find again the duality between AR and infinite-order MA. By using the
B operator, the AR(p) process is
(1 − α1B − · · · − αpBp
)ϕt = zt
or equivalently
ϕt = zt/(1 − α1B − · · · − αpBp
) = f(B)zt
where
f(B) = (1 − α1B − · · · − αpBp
)−1
= (1 + β1B + β2B2
+ . . . )
It has been shown that condition necessary and sufficient for the stationarity
is that the roots of the equation
φ(B) = 1 − α1B − · · · − αpBp
= 0
lie outside the unit circle.
Computational Intelligence Methods for Prediction – p. 24/83
Autocorrelation in AR(q)
Unlike the autocorrelation function in MA(q) which cuts off at lag q, the
autocorrelation of an AR(q) attenuates slowly.
0 10 20 30 40 50 60 70 80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
AR(16) correlation (absolute value)
Computational Intelligence Methods for Prediction – p. 25/83
Fitting an autoregressive process
The estimation of an autoregressive process to a set of data
DT = {ϕ1, . . . , ϕT } demands the resolution of two problems:
1. The estimation of the order p of the process.
2. The estimation of the set of parameters {α1, . . . , αp}.
Computational Intelligence Methods for Prediction – p. 26/83
Estimation of AR(p) parameters
Suppose we have an AR(p) process of order p
ϕt = α1ϕt−1 + · · · + αpϕt−p + wt
Given T observations, the parameters may be estimated by least-squares by
minimizing
ˆα = arg min
α
T
t=p+1
[ϕt − α1ϕt−1 + · · · + αpϕt−p]
2
In matrix form this amounts to solve the multiple least-squares problem where
Computational Intelligence Methods for Prediction – p. 27/83
Estimation of AR(p) parameters (II)
X =







ϕT −1 ϕT −2 . . . ϕT −p−1
ϕT −2 ϕT −3 . . . ϕT −p−2
...
...
...
...
ϕp ϕp−1 . . . ϕ1







Y =







ϕT
ϕT −1
...
ϕp+1







(1)
Most of the considerations made for multiple linear regression still hold in this
case.
Computational Intelligence Methods for Prediction – p. 28/83
The NAR representation
• AR models assume that the relation between past and future is linear
• Once we assume that the linear assumption does not hold, we may
extend the AR formulation to a Nonlinear Auto Regressive (NAR)
formulation
ϕt
= f ϕt−d
, ϕt−d−1
, . . . , ϕt−d−n+1
+ w(t)
where the missing information is lumped into a noise term w.
• In what follows we will consider this relationship as a particular instance
of a dependence
y = f(x) + w
between a multidimensional input x ∈ X ⊂ Rn
and a scalar output y ∈ R.
Computational Intelligence Methods for Prediction – p. 29/83
Nonlinear vs. linear
• AR models assume that the relation between past and future is linear
• The advantage of linear models are numerous:
• the least-squares ˆβ estimate can be expressed in an analytical form
• the least-squares ˆβ estimate can be easily calculated through matrix
computation.
• statistical properties of the estimator can be easily defined.
• recursive formulation for sequential updating are avaialble.
• the relation between empirical and generalization error is known.
• Unfortunately in real problem it is extremely unlikely that the input and
ouput variables are linked by a linear relation.
• Moreover, the form of the relation is often unknown and only a limited
amount of samples is available.
Computational Intelligence Methods for Prediction – p. 30/83
Supervised learning
TRAINING
DATASET
UNKNOWN
DEPENDENCY
INPUT OUTPUT
ERROR
PREDICTION
MODEL
PREDICTION
The prediction problem is also known as the supervised learning problem,
because of the presence of the outcome variable which guides the learning
process.
Collecting a set of training data is analogous to the situation where a teacher
suggests the correct answer for each input configuration.
Computational Intelligence Methods for Prediction – p. 31/83
The regression plus noise form
• A typical way of representing the unknown input/output relation is the
regression plus noise form
y = f(x) + w
where f(·) is a deterministic function and the term w represents the
noise or random error. It is typically assumed that w is independent of x
and E[w] = 0.
• Suppose that we have available a training set { xi, yi : i = 1, . . . , N},
where xi = (xi1, . . . , xin), generated according to the previous model.
• The goal of a learning procedure is to find a model h(x) which is able to
give a good approximation of the unknown function f(x).
• But how to choose h,if we do not know the probability distribution
underlying the data and we have only a limited training set?
Computational Intelligence Methods for Prediction – p. 32/83
A simple example
−2 −1 0 1 2
−5051015
x
Y
Computational Intelligence Methods for Prediction – p. 33/83
Model 1
−2 −1 0 1 2
−5051015
x
Y
Training error= 2 degree= 1
Computational Intelligence Methods for Prediction – p. 34/83
Model 2
−2 −1 0 1 2
−5051015
x
Y
Training error= 0.92 degree= 3
Computational Intelligence Methods for Prediction – p. 35/83
Model 3
−2 −1 0 1 2
−5051015
x
Y
Training error= 0.4 degree= 18
Computational Intelligence Methods for Prediction – p. 36/83
Generalization and overfitting
• How to estimate the quality of a model? Is the training error a good
measure of the quality?
• The goal of learning is to find a model which is able to generalize, i.e. able
to return good predictions for input sets independent of the training set.
• In a nonlinear setting, it is possible to find models with such a complicate
structure that they have a null empirical risk. Are these models good?
• Typically NOT. Since doing very well on the training set could mean
doing badly on new data.
• This is the phenomenon of overfitting.
• Using the same data for training a model and assessing it is typically a
wrong procedure, since this returns an over optimistic assessment of the
model generalization capability.
Computational Intelligence Methods for Prediction – p. 37/83
Bias and variance of a model
A result of estimation theory shows that the mean-squared-error, i.e. a
measure of the generalization quality of an estimator can be decomposed
into three terms:
MSE = σ2
w + squared bias + variance
where the intrinsic noise term reflects the target alone, the bias reflects the
target’s relation with the learning algorithm and the variance term reflects the
learning algorithm alone.
This result is theoretical since these quantities cannot be measured on the
basis of a finite amount of data. However, this result provide insight about
what makes accurate a learning process.
Computational Intelligence Methods for Prediction – p. 38/83
The bias/variance trade-off
• The first term is the variance of y around its true mean f(x) and cannot
be avoided no matter how well we estimate f(x), unless σ2
w = 0.
• The bias measures the difference in x between the average of the
outputs of the hypothesis functions over the set of possible DN and the
regression function value f(x)
• The variance reflects the variability of the guessed h(x, αN ) as one
varies over training sets of fixed dimension N. This quantity measures
how sensitive the algorithm is to changes in the data set, regardless to
the target.
Computational Intelligence Methods for Prediction – p. 39/83
The bias/variance dilemma
• The designer of a learning machine has not access to the term MSE but
can only estimate it on the basis of the training set. Hence, the
bias/variance decomposition is relevant in practical learning since it
provides a useful hint about the features to control in order to make the
error MSE small.
• The bias term measures the lack of representational power of the class
of hypotheses. To reduce the bias term we should consider complex
hypotheses which can approximate a large number of input/output
mappings.
• The variance term warns us against an excessive complexity of the
approximator. This means that a class of too powerful hypotheses runs
the risk of being excessively sensitive to the noise affecting the training
set; therefore, our class could contain the target but it could be
practically impossible to find it out on the basis of the available dataset.
Computational Intelligence Methods for Prediction – p. 40/83
• In other terms, it is commonly said that an hypothesis with large bias but
low variance underfits the data while an hypothesis with low bias but
large variance overfits the data.
• In both cases, the hypothesis gives a poor representation of the target
and a reasonable trade-off needs to be found.
• The task of the model designer is to search for the optimal trade-off
between the variance and the bias term, on the basis of the available
training set.
Computational Intelligence Methods for Prediction – p. 41/83
Bias/variance trade-off
complexity
generalization
error
Bias
Variance
Underfitting Overfitting
Computational Intelligence Methods for Prediction – p. 42/83
The learning procedure
A learning procedure aims at two main goals:
1. to choose a parametric family of hypothesis h(x, α) which contains or
gives good approximation of the unknown function f (structural
identification).
2. within the family h(x, α), to estimate on the basis of the training set DN
the parameter αN which best approximates f (parametric identification).
In order to accomplish that, a learning procedure is made of two nested
loops:
1. an external structural identification loop which goes through different
model structures
2. an inner parametric identification loop which searches for the best
parameter vector within the family structure.
Computational Intelligence Methods for Prediction – p. 43/83
Parametric identification
The parametric identification of the hypothesis is done according to ERM
(Empirical Risk Minimization) principle where
αN = α(DN ) = arg min
α∈Λ
MISEemp(α)
minimizes the empirical risk or training error
MISEemp(α) =
N
i=1 (yi − h(xi, α))
2
N
constructed on the basis of the data set DN .
Computational Intelligence Methods for Prediction – p. 44/83
Parametric identification (II)
• The computation of αN requires a procedure of multivariate optimization
in the space of parameters.
• The complexity of the optimization depends on the form of h(·).
• In some cases the parametric identification problem may be an NP-hard
problem.
• Thus, we must resort to some form of heuristic search.
• Examples of parametric identification procedure are linear least-squares
for linear models and backpropagated gradient-descent for feedforward
neural networks.
Computational Intelligence Methods for Prediction – p. 45/83
Validation techniques
How to measure MISE in a reliable way on a finite dataset? The most
common techniques to return an estimate MISE are
Testing: a testing sequence independent of DN and distributed according to
the same probability distribution is used to assess the quality. In
practice, unfortunately, an additional set of input/output observations is
rarely available.
Holdout: The holdout method, sometimes called test sample estimation,
partitions the data DN into two mutually exclusive subsets, the training
set Dtr and the holdout or test set DNts .
k-fold Cross-validation: the set DN is randomly divided into k mutually
exclusive test partitions of approximately equal size. The cases not
found in each test partition are independently used for selecting the
hypothesis which will be tested on the partition itself. The average error
over all the k partitions is the cross-validated error rate.
Computational Intelligence Methods for Prediction – p. 46/83
The K-fold cross-validation
This is the algorithm in detail:
1. split the dataset DN into k roughly equal-sized parts.
2. For the kth part k = 1, . . . , K, fit the model to the other K − 1 parts of the
data, and calculate the prediction error of the fitted model when
predicting the k-th part of the data.
3. Do the above for k = 1, . . . , K and average the K estimates of prediction
error.
Let k(i) be the part of DN containing the ith sample. Then the
cross-validation estimate of the MISE prediction error is
MISECV =
1
N
N
i=1
(yi − ˆy
−k(i)
i )2
=
1
N
N
i=1
yi − h(xi, α−k(i)
)
2
where ˆy
−k(i)
i denotes the fitted value for the ith observation returned by the
model estimated with the k(i)th part of the data removed.
Computational Intelligence Methods for Prediction – p. 47/83
10-fold cross-validation
K = 10: at each iteration 90% of data are used for training and the remaining
10% for the test.
90%
10%
Computational Intelligence Methods for Prediction – p. 48/83
Leave-one-out cross validation
• The cross-validation algorithm where K = N is also called the
leave-one-out algorithm.
• This means that for each ith sample, i = 1, . . . , N,
1. we carry out the parametric identification, leaving that observation
out of the training set,
2. we compute the predicted value for the ith observation, denoted by
ˆy−i
i
The corresponding estimate of the MISE prediction error is
MISELOO =
1
N
N
i=1
(yi − ˆy−i
i )2
=
1
N
N
i=1
(yi − h xi, α−i
)
2
where α−i
is the set of parameters returned by the parametric identification
perfomed on the training set with the ith sample set aside.
Computational Intelligence Methods for Prediction – p. 49/83
Model selection
• Model selection concerns the final choice of the model structure in the
set that has been proposed by model generation and assessed by
model validation.
• In real problems, this choice is typically a subjective issue and is often
the result of a compromise between different factors, like the quantitative
measures, the personal experience of the designer and the effort
required to implement a particular model in practice.
• Here we will consider only quantitative criteria. Two are the possible
approaches:
1. the winner-takes-all approach
2. the combination of estimators approach.
Computational Intelligence Methods for Prediction – p. 50/83
Model selection
N
REALIZATION
STOCHASTIC
PROCESS
VALIDATION
CLASSES of HYPOTHESIS
LEARNED MODEL
TRAINING
SET
MODEL SELECTION
PARAMETRIC IDENTIFICATION
IDENTIFICATION
,,
, ,
,
,
STRUCTURAL
α
?
ΛSΛ2Λ1
GN
1
α
1
N
α
1
N
α
2
N
α
2
N
αN
s
αN
s
GN
2
GN
S
GN
2
GN
1
GN
S
Computational Intelligence Methods for Prediction – p. 51/83
Winner-takes-all
The best hypothesis is selected in the set {αs
N }, with s = 1, . . . , S, according
to
˜s = arg min
s=1,...,S
MISE
s
A model with complexity ˜s is trained on the whole dataset DN and used for
future predictions.
Computational Intelligence Methods for Prediction – p. 52/83
Winner-takes-all pseudo-code
1. for s = 1, . . . , S: (Structural loop)
• for j = 1, . . . , N
(a) Inner parametric identification (for l-o-o):
αs
N−1 = arg min
α∈Λs
i=1:N,i=j
(yi − h(xi, α))2
(b) ej = yj − h(xj, αs
N−1)
• MISELOO(s) = 1
N
N
j=1 e2
j
2. Model selection: ˜s = arg mins=1,...,S MISELOO(s)
3. Final parametric identification:
α˜s
N = arg minα∈Λ˜s
N
i=1(yi − h(xi, α))2
4. The output prediction model is h(·, α˜s
N )
Computational Intelligence Methods for Prediction – p. 53/83
Model combination
• The winner-takes-all approach is intuitively the approach which should
work the best.
• However, recent results in machine learning show that the performance
of the final model can be improved not by choosing the model structure
which is expected to predict the best but by creating a model whose
output is the combination of the output of models having different
structures.
• The reason is that in reality any chosen hypothesis h(·, αN ) is only an
estimate of the real target and, like any estimate, is affected by a bias
and a variance term.
• Theoretical results on the combination of estimators show that the
combination of unbiased estimators leads an unbiased estimator with
reduced variance.
• This principle is at the basis of approaches like bagging or boosting.
Computational Intelligence Methods for Prediction – p. 54/83
Local modeling procedure
The learning of a local model in xq ∈ Rn
can be summarized in these steps:
1. Compute the distance between the query xq and the training samples
according to a predefined metric.
2. Rank the neighbors on the basis of their distance to the query.
3. Select a subset of the k nearest neighbors according to the bandwidth
which measures the size of the neighborhood.
4. Fit a local model (e.g. constant, linear,...).
Each of the local approaches has one or more structural (or smoothing)
parameters that control the amount of smoothing performed.
Let us focus on the bandwidth selection.
Computational Intelligence Methods for Prediction – p. 55/83
The bandwidth trade-off: overfit
e
q
0011
0011
0011
01
01
0011
01
0011
0011
00
00
11
11 0
0
1
1
0011
0011
00001111
01
00001111
0011
01
0011
00001111
0011
0011
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
0011
0011
0011
01
01
0011
01
0011
0011
00
00
11
11 0
0
1
1
0011
000
111
00001111
01
00001111
0011
01
0011
00001111
0011
000000
111111
0011
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
Too narrow bandwidth ⇒ overfitting ⇒ large prediction error e.
In terms of bias/variance trade-off, this is typically a situation of high variance.
Computational Intelligence Methods for Prediction – p. 56/83
The bandwidth trade-off: underfit
e
q
0011
0011
0011
01
01
0011
01
0011
0011
00
00
11
11 0
0
1
1
0011
00001111
01
00001111
0011
01
0011
00001111
0011
0011
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
0011
0011
0011
01
01
0011
000000
111111
0
00
1
11 00
0000
11
1111
0011
000
111
00001111
000
111
01
001100001111
00110011
000000
111111 0011
00001111
000000
111111
0011
000
111
0
00
1
11
000000
111111
00001111
000000000000000000000000000000000000000000011111111111111111111111111111111111111111110
00
00
0
00
00
000000000000000000000
1
11
11
1
11
11
111111111111111111111
x
y
Too large bandwidth ⇒ underfitting ⇒ large prediction error e
In terms of bias/variance trade-off, this is typically a situation of high bias.
Computational Intelligence Methods for Prediction – p. 57/83
Bandwidth and bias/variance trade-off
Mean Squared Error
1/Bandwith
FEW NEIGHBORSMANY NEIGHBORS
Bias
Variance
Underfitting Overfitting
Computational Intelligence Methods for Prediction – p. 58/83
The PRESS statistic
• Cross-validation can provide a reliable estimate of the algorithm
generalization error but it requires the training process to be repeated K
times, which sometimes means a large computational effort.
• In the case of linear models there exists a powerful statistical procedure
to compute the leave-one-out cross-validation measure at a reduced
computational cost
• It is the PRESS (Prediction Sum of Squares) statistic, a simple formula
which returns the leave-one-out (l-o-o) as a by-product of the
least-squares.
Computational Intelligence Methods for Prediction – p. 59/83
Leave-one-out for linear models
PARAMETRIC IDENTIFICATION ON N-1 SAMPLES
PUT THE j-th SAMPLE ASIDE
TEST ON THE j-th SAMPLE
PARAMETRIC IDENTIFICATION
ON N SAMPLES
N TIMES
TRAINING SET
PRESS STATISTIC
LEAVE-ONE-OUT
The leave-one-out error can be computed in two equivalent ways: the slowest
way (on the right) which repeats N times the training and the test procedure;
the fastest way (on the left) which performs only once the parametric
identification and the computation of the PRESS statistic.
Computational Intelligence Methods for Prediction – p. 60/83
The PRESS statistic
• This allows a fast cross-validation without repeating N times the
leave-one-out procedure. The PRESS procedure can be described as
follows:
1. we use the whole training set to estimate the linear regression
coefficients
ˆβ = (XT
X)−1
XT
Y
2. This procedure is performed only once on the N samples and
returns as by product the Hat matrix
H = X(XT
X)−1
XT
3. we compute the residual vector e, whose jth
term is ej = yj − xT
j
ˆβ,
4. we use the PRESS statistic to compute eloo
j as
eloo
j =
ej
1 − Hjj
where Hjj is the jth
diagonal term of the matrix H.
Computational Intelligence Methods for Prediction – p. 61/83
The PRESS statistic
Thus, the leave-one-out estimate of the local mean integrated squared error
is:
MISELOO =
1
N
N
i=1
yi − ˆyi
1 − Hii
2
Note that PRESS is not an approximation of the loo error but simply a faster
way of computing it.
Computational Intelligence Methods for Prediction – p. 62/83
Selection of the number of neighbours
• For a given query point xq, we can compute a set of predictions
ˆyq(k) = xT
q
ˆβ(k)
, together with a set of associated leave-one-out error vectors
MISELOO(k) for a number of neighbors ranging in [kmin, kmax].
• If the selection paradigm, frequently called winner-takes-all, is adopted,
the most natural way to extract a final prediction ˆyq, consists in
comparing the prediction obtained for each value of k on the basis of the
classical mean square error criterion:
ˆyq = xT
q
ˆβ(ˆk), with ˆk = arg min
k
MISELOO(k)
Computational Intelligence Methods for Prediction – p. 63/83
Local Model combination
• As an alternative to the winner-takes-all paradigm, we can use a
combination of estimates.
• The final prediction of the value yq is obtained as a weighted average of
the best b models, where b is a parameter of the algorithm.
• Suppose the predictions ˆyq(k) and the loo errors MISELOO(k) have been
ordered creating a sequence of integers {ki} so that
MISELOO(ki) ≤ MISELOO(kj), ∀i < j. The prediction of ˆyq is given by
ˆyq =
b
i=1 ζi ˆyq(ki)
b
i=1 ζi
,
where the weights are the inverse of the mean square errors:
ζi = 1/MISELOO(ki).
Computational Intelligence Methods for Prediction – p. 64/83
One step-ahead and iterated prediction
• Once a model of the embedding mapping is available, it can be used for
two objectives: one-step-ahead prediction and iterated prediction.
• In one-step-ahead prediction, the n previous values of the series are
available and the forecasting problem can be cast in the form of a
generic regression problem
• In literature a number of supervised learning approaches have been
used with success to perform one-step-ahead forecasting on the basis
of historical data.
Computational Intelligence Methods for Prediction – p. 65/83
One step-ahead prediction
f
ϕt-2
z-1
z-1
z-1
ϕt-3
ϕt-n
ϕt-1
z-1
ϕt
The approximator ˆf returns the prediction of the value of the time series at
time t + 1 as a function of the n previous values (the rectangular box
containing z−1
represents a unit delay operator, i.e., ϕt−1
= z−1
ϕt
).
Computational Intelligence Methods for Prediction – p. 66/83
Multi-step ahead prediction
• The prediction of the value of a time series H > A steps ahead is called
H-step-ahead prediction.
• We classify the methods for H-step-ahead prediction according to two
features: the horizon of the training criterion and the single-output or
multi-output nature of the predictor.
Computational Intelligence Methods for Prediction – p. 67/83
Multi-step ahead prediction strategies
The most common strategies are
1. Iterated: the model predicts h steps ahead by iterating a one-step-ahead
predictor whose parameters are optimized to minimize the training error
on one-step-ahead forecast (one-step-ahead training criterion).
2. Iterated strategy where parameters are optimized to minimize the
training error on the iterated htr-step-ahead forecast (htr-step-ahead
training criterion) where 1 < htr ≤ H.
3. Direct: the model makes a direct forecast at time t + h − 1, h = 1, . . . , H
by modeling the time series in a multi-input single-output form
4. Direc: direct forecast but the input vector is extended at each step with
predicted values.
5. MIMO: the model returns a vectorial forecast by modeling the time
series in a multi-input multi-output form
Computational Intelligence Methods for Prediction – p. 68/83
Iterated (or recursive) prediction
• In the case of iterated prediction, the predicted output is fed back as
input for the next prediction.
• Here, the inputs consist of predicted values as opposed to actual
observations of the original time series.
• As the feedback values are typically distorted by the errors made by the
predictor in previous steps, the iterative procedure may produce
undesired effects of accumulation of the error.
• Low performance is expected in long horizon tasks. This is due to the
fact that they are essentially models tuned with a one-step-ahead
criterion which is not capable of taking temporal behavior into account.
Computational Intelligence Methods for Prediction – p. 69/83
Iterated prediction
f
ϕt-2
z-1
z-1
z-1
z-1
ϕt-3
ϕt-n
ϕt-1
z-1
ϕt
The approximator ˆf returns the prediction of the value of the time series at
time t + 1 by iterating the predictions obtained in the previous steps (the
rectangular box containing z−1
represents a unit delay operator, i.e.,
ˆϕt−1
= z−1
ˆϕt
).
Computational Intelligence Methods for Prediction – p. 70/83
Iterated with h-step training criterion
• This strategy adopts one-step-ahead predictors but adapts the model
selection criterion in order to take into account the multi-step-ahead
objective.
• Methods like Recurrent Neural Networks belong to such class. Their
recurrent architecture and the associated training algorithm (temporal
backpropagation) are suitable to handle the time-dependent nature of
the data.
• In [?] we proposed an adaptation of the Lazy Learning algorithm where
the number of neighbors is optimized in order to minimize the
leave-one-out error over an horizon larger than one. This technique
ranked second in the 1998 KULeuven Time Series competition.
• A similar technique has been proposed by [?] who won the competition.
Computational Intelligence Methods for Prediction – p. 71/83
Direct strategy
• The Direct strategy [?, ?, ?] learns independently H models fh
ϕt+h = fh(ϕt, . . . , ϕt−n+1) + wt+h
with t ∈ {n, . . . , N − H} and h ∈ {1, . . . , H} and returns a multi-step
forecast by concatenating the H predictions.
Computational Intelligence Methods for Prediction – p. 72/83
Direct strategy
• Since the Direct strategy does not use any approximated values to
compute the forecasts, it is not prone to any accumulation of errors,
since each model is tailored for the horizon it is supposed to predict.
Notwithstanding, it has some weaknesses.
• Since the H models are learned independently no statistical
dependencies between the predictions ˆyN+h[?, ?, ?] is considered.
• Direct methods often require higher functional complexity [?] than
iterated ones in order to model the stochastic dependency between two
series values at two distant instants [?].
• This strategy demands a large computational time since the number of
models to learn is equal to the size of the horizon.
• Different machine learning models have been used to implement the
Direct strategy for multi-step forecasting tasks, for instance neural
networks [?], nearest neighbors [?] and decision trees [?].
Computational Intelligence Methods for Prediction – p. 73/83
DirRec strategy
• The DirRec strategy [?] combines the architectures and the principles
underlying the Direct and the Recursive strategies.
• DirRec computes the forecasts with different models for every
horizon (like the Direct strategy) and, at each time step, it enlarges the
set of inputs by adding variables corresponding to the forecasts of the
previous step (like the Recursive strategy).
• Unlike the previous strategies, the embedding size n is not the same for
all the horizons. In other terms, the DirRec strategy learns H models fh
from the time series [y1, . . . , yN ] where
yt+h = fh(yt+h−1, . . . , yt−n+1) + wt+h
with t ∈ {n, . . . , N − H} and h ∈ {1, . . . , H}.
• The technique is prone to the curse of dimensionality. The use of feature
selection is recommended for large h.
Computational Intelligence Methods for Prediction – p. 74/83
MIMO strategy
• This strategy [?, ?] (also known as Joint strategy [?]) avoids the simplistic
assumption of conditional independence between future values made by
the Direct strategy [?, ?] by learning a single multiple-output model
[yt+H, . . . , yt+1] = F(yt, . . . , yt−n+1) + w
where t ∈ {n, . . . , N − H}, F : Rd
→ RH
is a vector-valued function [?],
and w ∈ RH
is a noise vector with a covariance that is not necessarily
diagonal [?].
• The forecasts are returned in one step by a multiple-output model ˆF
where
[ˆyt+H , . . . , ˆyt+1] = ˆF(yN , . . . , yN−n+1)
Computational Intelligence Methods for Prediction – p. 75/83
MIMO strategy
• The rationale of the MIMO strategy is to model, between the predicted
values, the stochastic dependency characterizing the time series. This
strategy avoids the conditional independence assumption made by the
Direct strategy as well as the accumulation of errors which plagues the
Recursive strategy. So far, this strategy has been successfully applied to
several real-world multi-step time series forecasting tasks [?, ?, ?, ?].
• However, the wish to preserve the stochastic dependencies constrains
all the horizons to be forecasted with the same model structure. Since
this constraint could reduce the flexibility of the forecasting approach [?],
a variant of the MIMO strategy has been proposed in [?, ?] .
Computational Intelligence Methods for Prediction – p. 76/83
Validation of time series methods
• The huge variety of strategies and algorithms that can be used to infer a
predictor from observed data asks for a rigorous procedure of
comparison and assessment.
• Assessment demands benchmarks and benchmarking procedure.
• Benchmarks can be defined by using
• Simulated data obtained by simulating AR, NAR and other stochastic
processes. This is particular useful for validating theoretical
properties in terms of bias/variance.
• Public domain benchmarks, like the one provided by Time Series
Competitions.
• Real measured data
Computational Intelligence Methods for Prediction – p. 77/83
Competitions
• Santa Fe Time Series Prediction and Analysis Competition (1994) [?]:
• International Workshop on Advanced Black-box techniques for nonlinear
modeling Competition (Leuven, Belgium; 1998)
• NN3 competition [?]: 111 monthly time series drawn from homogeneous
population of empirical business time series.
• NN5 competition [?]: 111 time series of the daily retirement amounts
from independent cash machines at different, randomly selected
locations across England.
Computational Intelligence Methods for Prediction – p. 78/83
MLG projects on forecasting
Wireless sensor
Anesthesia
Car market prediction
Computational Intelligence Methods for Prediction – p. 79/83
Open-source software
Computational Intelligence Methods for Prediction – p. 80/83
All that we didn’t have time to discuss
Computational Intelligence Methods for Prediction – p. 81/83
Perspectives
Computational Intelligence Methods for Prediction – p. 82/83
Suggestions for PhD topics
Computational Intelligence Methods for Prediction – p. 83/83

More Related Content

Viewers also liked

Nigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent watersNigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent watersDung Rwang Pam
 
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)Dung Rwang Pam
 
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...Dung Rwang Pam
 
Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...Dung Rwang Pam
 
A General Framework for Enhancing Prediction Performance on Time Series Data
A General Framework for Enhancing Prediction Performance on Time Series DataA General Framework for Enhancing Prediction Performance on Time Series Data
A General Framework for Enhancing Prediction Performance on Time Series DataHopeBay Technologies, Inc.
 
Mba 532 2011_part_3_time_series_analysis
Mba 532 2011_part_3_time_series_analysisMba 532 2011_part_3_time_series_analysis
Mba 532 2011_part_3_time_series_analysisChandra Kodituwakku
 
Extending and integrating a hybrid knowledge representation system into the c...
Extending and integrating a hybrid knowledge representation system into the c...Extending and integrating a hybrid knowledge representation system into the c...
Extending and integrating a hybrid knowledge representation system into the c...Valentina Rho
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningSri Ambati
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies台灣資料科學年會
 
[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析
[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析
[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析台灣資料科學年會
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123台灣資料科學年會
 
Transform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningTransform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningSri Ambati
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程台灣資料科學年會
 
Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer BuildPetteriTeikariPhD
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)台灣資料科學年會
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習台灣資料科學年會
 

Viewers also liked (20)

Computer Engineer Master Project
Computer Engineer Master ProjectComputer Engineer Master Project
Computer Engineer Master Project
 
Nigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent watersNigeria aviation industry drifting in turbulent waters
Nigeria aviation industry drifting in turbulent waters
 
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
A discourse on aviation with the Nigeria aviation safety initiaitive (NASI)
 
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
When safety comes last; A short synopisis of events in Nigeria aviation (Pam,...
 
Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...Safe and sustainable aviation in africa; alignment of policies, regulation an...
Safe and sustainable aviation in africa; alignment of policies, regulation an...
 
A General Framework for Enhancing Prediction Performance on Time Series Data
A General Framework for Enhancing Prediction Performance on Time Series DataA General Framework for Enhancing Prediction Performance on Time Series Data
A General Framework for Enhancing Prediction Performance on Time Series Data
 
1634 time series and trend analysis
1634 time series and trend analysis1634 time series and trend analysis
1634 time series and trend analysis
 
Mba 532 2011_part_3_time_series_analysis
Mba 532 2011_part_3_time_series_analysisMba 532 2011_part_3_time_series_analysis
Mba 532 2011_part_3_time_series_analysis
 
Extending and integrating a hybrid knowledge representation system into the c...
Extending and integrating a hybrid knowledge representation system into the c...Extending and integrating a hybrid knowledge representation system into the c...
Extending and integrating a hybrid knowledge representation system into the c...
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R
 
How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies[系列活動] 資料探勘速遊 - Session4 case-studies
[系列活動] 資料探勘速遊 - Session4 case-studies
 
[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析
[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析
[DSC 2016] 系列活動:李祈均 / 人類行為大數據分析
 
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 給工程師的統計學及資料分析 123
 
Transform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine LearningTransform your Business with AI, Deep Learning and Machine Learning
Transform your Business with AI, Deep Learning and Machine Learning
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
 
Deep Learning Computer Build
Deep Learning Computer BuildDeep Learning Computer Build
Deep Learning Computer Build
 
Deep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applicationsDeep learning - Conceptual understanding and applications
Deep learning - Conceptual understanding and applications
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
 
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
[DSC 2016] 系列活動:李宏毅 / 一天搞懂深度學習
 

Similar to Computational Intelligence for Time Series Prediction

Presentation for the Burstiness satellite at CCS'16 Amsterdam
Presentation for the Burstiness satellite at CCS'16 AmsterdamPresentation for the Burstiness satellite at CCS'16 Amsterdam
Presentation for the Burstiness satellite at CCS'16 AmsterdamMikko Kivelä
 
Lecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptxLecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptxHarisMasood20
 
Implementation of adaptive stft algorithm for lfm signals
Implementation of adaptive stft algorithm for lfm signalsImplementation of adaptive stft algorithm for lfm signals
Implementation of adaptive stft algorithm for lfm signalseSAT Journals
 
Introduction to Communication Systems 2
Introduction to Communication Systems 2Introduction to Communication Systems 2
Introduction to Communication Systems 2slmnsvn
 
Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)
Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)
Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)Adnan Zafar
 
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsAnomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsCynthia Freeman
 
THE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHM
THE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHMTHE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHM
THE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHMIJCSEA Journal
 
Adesanya dissagregation of data corrected
Adesanya dissagregation of data correctedAdesanya dissagregation of data corrected
Adesanya dissagregation of data correctedAlexander Decker
 
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHMCOMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHMcsitconf
 
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHMCOMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHMcscpconf
 
Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4Cdiscount
 
State Space Model
State Space ModelState Space Model
State Space ModelCdiscount
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
 
Ff tand matlab-wanjun huang
Ff tand matlab-wanjun huangFf tand matlab-wanjun huang
Ff tand matlab-wanjun huangjhonce
 

Similar to Computational Intelligence for Time Series Prediction (20)

Master_Thesis_Harihara_Subramanyam_Sreenivasan
Master_Thesis_Harihara_Subramanyam_SreenivasanMaster_Thesis_Harihara_Subramanyam_Sreenivasan
Master_Thesis_Harihara_Subramanyam_Sreenivasan
 
intro
introintro
intro
 
V01 i010401
V01 i010401V01 i010401
V01 i010401
 
Sampling theorem
Sampling theoremSampling theorem
Sampling theorem
 
Basic concepts
Basic conceptsBasic concepts
Basic concepts
 
Presentation for the Burstiness satellite at CCS'16 Amsterdam
Presentation for the Burstiness satellite at CCS'16 AmsterdamPresentation for the Burstiness satellite at CCS'16 Amsterdam
Presentation for the Burstiness satellite at CCS'16 Amsterdam
 
Lecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptxLecture 1 (ADSP).pptx
Lecture 1 (ADSP).pptx
 
Implementation of adaptive stft algorithm for lfm signals
Implementation of adaptive stft algorithm for lfm signalsImplementation of adaptive stft algorithm for lfm signals
Implementation of adaptive stft algorithm for lfm signals
 
Introduction to Communication Systems 2
Introduction to Communication Systems 2Introduction to Communication Systems 2
Introduction to Communication Systems 2
 
Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)
Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)
Communication Systems_B.P. Lathi and Zhi Ding (Lecture No 4-9)
 
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language ModelsAnomaly Detection in Sequences of Short Text Using Iterative Language Models
Anomaly Detection in Sequences of Short Text Using Iterative Language Models
 
THE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHM
THE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHMTHE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHM
THE RESEARCH OF QUANTUM PHASE ESTIMATION ALGORITHM
 
Adesanya dissagregation of data corrected
Adesanya dissagregation of data correctedAdesanya dissagregation of data corrected
Adesanya dissagregation of data corrected
 
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHMCOMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
 
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHMCOMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
COMPUTATIONAL PERFORMANCE OF QUANTUM PHASE ESTIMATION ALGORITHM
 
Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4
 
State Space Model
State Space ModelState Space Model
State Space Model
 
Aq4201280292
Aq4201280292Aq4201280292
Aq4201280292
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
Ff tand matlab-wanjun huang
Ff tand matlab-wanjun huangFf tand matlab-wanjun huang
Ff tand matlab-wanjun huang
 

More from Gianluca Bontempi

A statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modelingA statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modelingGianluca Bontempi
 
Adaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksAdaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksGianluca Bontempi
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionGianluca Bontempi
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...Gianluca Bontempi
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray dataGianluca Bontempi
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningGianluca Bontempi
 
FP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorFP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorGianluca Bontempi
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsGianluca Bontempi
 

More from Gianluca Bontempi (8)

A statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modelingA statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modeling
 
Adaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksAdaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor Networks
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray data
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine Learning
 
FP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorFP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluator
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformatics
 

Recently uploaded

DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptTanveerAhmed817946
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...Amara arora$V15
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证pwgnohujw
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIf6x4zqzk86
 

Recently uploaded (20)

DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
ℂall Girls In Navi Mumbai Hire Me Neha 9910780858 Top Class ℂall Girl Serviℂe...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 

Computational Intelligence for Time Series Prediction

  • 1. Computational Intelligence Methods for Time Series Prediction Second European Business Intelligence Summer School (eBISS 2012) Gianluca Bontempi Département d’Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Computational Intelligence Methods for Prediction – p. 1/83
  • 2. Outline • Introduction (5 mins) • Notions of time series (30 mins) • Machine learning for prediction (60 mins) • bias/variance • parametric and structural identification • validation • model selection • feature selection • Local learning • Forecasting: one-step and multi-step-ahed (30 mins) • Some applications (30 mins) • time series competitions • chaotic time series • wireless sensor • biomedical • Future directions and perspectives (15 mins) Computational Intelligence Methods for Prediction – p. 2/83
  • 3. ULB Machine Learning Group (MLG) • 8 researchers (2 prof, 5 PhD students, 4 postdocs). • Research topics: Knowledge discovery from data, Classification, Computational statistics, Data mining, Regression, Time series prediction, Sensor networks, Bioinformatics, Network inference. • Computing facilities: high-performing cluster for analysis of massive datasets, Wireless Sensor Lab. • Website: mlg.ulb.ac.be. • Scientific collaborations in ULB: Bioinformatique des génomes et des réseaux (IBMM), CENOLI (Sciences), Microarray Unit (Hopital Jules Bordet), Laboratoire de Médecine experimentale, Laboratoire d’Anatomie, Biomécanique et Organogénèse (LABO), Service d’Anesthesie (ERASME). • Scientific collaborations outside ULB: Harvard Dana Farber (US), UCL Machine Learning Group (B), Politecnico di Milano (I), Universitá del Sannio (I), Helsinki Institute of Technology (FIN). Computational Intelligence Methods for Prediction – p. 3/83
  • 4. ULB-MLG: recent projects 1. Adaptive real-time machine learning for credit card fraud detection 2. Discovery of the molecular pathways regulating pancreatic beta cell dysfunction and apoptosis in diabetes using functional genomics and bioinformatics: ARC (2010-2015) 3. ICT4REHAB - Advanced ICT Platform for Rehabilitation (2011-2013) 4. Integrating experimental and theoretical approaches to decipher the molecular networks of nitrogen utilisation in yeast: ARC (2006-2010). 5. TANIA - Système d’aide à la conduite de l’anesthésie. WALEO II project funded by the Région Wallonne (2006-2010) 6. "COMP2 SYS" (COMPutational intelligence methods for COMPlex SYStems) MARIE CURIE Early Stage Research Training funded by the EU (2004-2008). Computational Intelligence Methods for Prediction – p. 4/83
  • 5. Time series Definition A time series is a sequence of observations, usually ordered in time. Examples of time series • Weather variables, like temperature, pressure • Economic factors. • Traffic. • Activity of business. • Electric load, power consumption. • Financial index. • Voltage. Computational Intelligence Methods for Prediction – p. 5/83
  • 6. Why studying time series? There are various reasons: Prediction of the future based on the past. Control of the process producing the series. Understanding of the mechanism generating the series. Description of the salient features of the series. Computational Intelligence Methods for Prediction – p. 6/83
  • 7. Univariate discrete time series • Quantities, like temperature and voltage, change in a continuous way. • In practice, however, the digital recording is made discretely in time. • We shall confine ourselves to discrete time series (which however take continuous values). • Moreover we will consider univariate time series, where one type of measurement is made repeatedly on the same object or individual. Computational Intelligence Methods for Prediction – p. 7/83
  • 8. A general model Let an observed discrete univariate time series be y1, . . . , yT . This means that we have T numbers which are observations on some variable made at T equally distant time points, which for convenience we label 1, 2, . . . , T. A fairly general model for the time series can be written yt = g(t) + ϕt t = 1, . . . , T The observed series is made of two components Systematic part: g(t), also called signal or trend, which is a determistic function of time Stochastic sequence: a residual term ϕt, also called noise, which follows a probability law. Computational Intelligence Methods for Prediction – p. 8/83
  • 9. Types of variation Traditional methods of time-series analysis are mainly concerned with decomposing the variation of a series yt into: Trend : this is a long-term change in the mean level, eg. an increasing trend. Seasonal effect : many time series (sale figures, temperature readings) exhibit variation which is seasonal (e.g. annual) in period. The measure and the removal of such variation brings to deseasonalized data. Irregular fluctuations : after trend and cyclic variations have been removed from a set of data, we are left with a series of residuals, which may or may not be completely random. We will assume here that once we have detrended and deseasonalized the series, we can still extract information about the dependency between the past and the future. Henceforth ϕt will denote the detrended and deseasonalized series. Computational Intelligence Methods for Prediction – p. 9/83
  • 10. 320340360 observed 320340360 trend −3−1123 seasonal −0.50.00.5 1960 1970 1980 1990 random Time Decomposition of additive time series Decomposition returned by the R package forecast. Computational Intelligence Methods for Prediction – p. 10/83
  • 11. Probability and dependency • Forecasting a time series is possible since future depends on the past or analogously because there is a relationship between the future and the past. However this relation is not deterministic and can hardly be written in an analytical form. • An effective way to describe a nondeterministic relation between two variables is provided by the probability formalism. • Consider two continuous random variables ϕ1 and ϕ2 representing for instance the temperature today and tomorrow. We tend to believe that ϕ1 could be used as a predictor of ϕ2 with some degree of uncertainty. • The stochastic dependency between ϕ1 and ϕ2 is resumed by the joint density p(ϕ1, ϕ2) or equivalently by the conditional probability p(ϕ2|ϕ1) = p(ϕ1, ϕ2) p(ϕ1) • If p(ϕ2|ϕ1) = p(ϕ2) then ϕ1 and ϕ2 are not independent or equivalently the knowledge of the value of ϕ1 reduces the uncertainty about ϕ2. Computational Intelligence Methods for Prediction – p. 11/83
  • 12. Stochastic processes • A discrete-time stochastic process is a collection of random variables ϕt, t = 1, . . . , T defined by a joint density p(ϕ1, . . . , ϕT ) • Statistical time-series analysis is concerned with evaluating the properties of the probability model which generated the observed time series. • Statistical time-series modeling is concerned with inferring the properties of the probability model which generated the observed time series from a limited set of observations. Computational Intelligence Methods for Prediction – p. 12/83
  • 13. Strictly stationary processes • Definition A stochastic process is said to be strictly stationary if the joint distribution of ϕt1 , ϕt2 , . . . , ϕtn is the same as the joint distribution of ϕt1+k, ϕt2+k, . . . , ϕtn+k for all n, t1, . . . , tn and k. • In other words shifting the time origin by an amount k has no effect on the joint distribution which depends only on the intervals between t1, . . . , tn. • This implies that the distribution of ϕt is the same for all t. • The definition holds for any value of n. • Let us see what does it mean in practice for n = 1 and n = 2. Computational Intelligence Methods for Prediction – p. 13/83
  • 14. Properties n=1 : If ϕt is strictly stationary and its first two moments are finite, we have µt = µ σ2 t = σ2 n=2 : Furthermore the autocovariance function γ(t1, t2) depends only on the lag k = t2 − t1 and may be written by γ(k) = Cov[ϕt, ϕt+k] In order to avoid scaling effects, it is useful to introduce the autocorrelation function ρ(k) = γ(k) σ2 = γ(k) γ(0) Computational Intelligence Methods for Prediction – p. 14/83
  • 15. Weak stationarity • A less restricted definition of stationarity concerns only the first two moments of ϕt Definition A process is called second-order stationary or weakly stationary if its mean is constant and its autocovariance function depends only on the lag. • No assumptions are made about higher moments than those of second order. • Strict stationarity implies weak stationarity. • In the special case of normal processes, weak stationarity implies strict stationarity. Definition A process is called normal is the joint distribution of ϕt1 , ϕt2 , . . . , ϕtn is multivariate normal for all t1, . . . , tn. Computational Intelligence Methods for Prediction – p. 15/83
  • 16. Purely random processes • It consists of a sequence of random variables ϕt which are mutually independent and identically distributed. For each t and k p(ϕt+k|ϕt) = p(ϕt+k) • It follows that this process has constant mean and variance. Also γ(k) = Cov[ϕt, ϕt+k] = 0 for k = ±1, ±2, . . . . • A purely random process is strictly stationary. • A purely random process is sometimes called white noise particularly by engineers. Computational Intelligence Methods for Prediction – p. 16/83
  • 17. Example: Gaussian purely random 0 10 20 30 40 50 60 70 80 90 100 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 Gaussian purely random Computational Intelligence Methods for Prediction – p. 17/83
  • 18. Example: Uniform purely random 0 10 20 30 40 50 60 70 80 90 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Uniform purely random Computational Intelligence Methods for Prediction – p. 18/83
  • 19. Random walk • Suppose that wt is a discrete, purely random process with mean µ and variance σ2 w. • A process ϕt is said to be a random walk if ϕt = ϕt−1 + wt • If ϕ0 = 0 then ϕt = t i=1 wi • E[ϕt] = tµ and Var [ϕt] = tσ2 w. • As the mean and variance change with t the process is non-stationary. Computational Intelligence Methods for Prediction – p. 19/83
  • 20. Random walk (II) • The first differences of a random walk given by ∇ϕt = ϕt − ϕt−1 form a purely random process, which is stationary. • The best-known examples of time series which behave like random walks are share prices on successive days. Computational Intelligence Methods for Prediction – p. 20/83
  • 21. Ten random walks Let w ∼ N(0, 1). 0 50 100 150 200 250 300 350 400 450 500 −40 −30 −20 −10 0 10 20 30 40 50 60 Random walks Computational Intelligence Methods for Prediction – p. 21/83
  • 22. Autoregressive processes Suppose that wt is a purely random process with mean zero and variance σ2 w. A process ϕt is said to be an autoregressive process of order p (also an AR(p) process) if ϕt = α1ϕt−1 + · · · + αpϕt−p + wt Note that this is like a multiple regression model where ϕ is regressed not on independent variables but on its past values (hence the prefix “auto”). Computational Intelligence Methods for Prediction – p. 22/83
  • 23. First order AR(1) process If p = 1, we have the so-called Markov process AR(1) ϕt = αϕt−1 + wt By substitution it can be shown that ϕt = α(αϕt−2 + wt−1) + wt = α2 (αϕt−3 + wt−2) + αwt−1 + wt = = wt + αwt−1 + α2 wt−2 + . . . Then E[ϕt] = 0 Var [zt] = σ2 w(1 + α2 + α4 + . . . ) Then if |α| < 1 the variance if finite and equals Var [ϕt] = σ2 ϕ = σ2 w/(1 − α2 ) and the autocorrelation is ρ(k) = αk k = 0, . . . , 1, 2 Computational Intelligence Methods for Prediction – p. 23/83
  • 24. General order AR(p) process We can find again the duality between AR and infinite-order MA. By using the B operator, the AR(p) process is (1 − α1B − · · · − αpBp )ϕt = zt or equivalently ϕt = zt/(1 − α1B − · · · − αpBp ) = f(B)zt where f(B) = (1 − α1B − · · · − αpBp )−1 = (1 + β1B + β2B2 + . . . ) It has been shown that condition necessary and sufficient for the stationarity is that the roots of the equation φ(B) = 1 − α1B − · · · − αpBp = 0 lie outside the unit circle. Computational Intelligence Methods for Prediction – p. 24/83
  • 25. Autocorrelation in AR(q) Unlike the autocorrelation function in MA(q) which cuts off at lag q, the autocorrelation of an AR(q) attenuates slowly. 0 10 20 30 40 50 60 70 80 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 AR(16) correlation (absolute value) Computational Intelligence Methods for Prediction – p. 25/83
  • 26. Fitting an autoregressive process The estimation of an autoregressive process to a set of data DT = {ϕ1, . . . , ϕT } demands the resolution of two problems: 1. The estimation of the order p of the process. 2. The estimation of the set of parameters {α1, . . . , αp}. Computational Intelligence Methods for Prediction – p. 26/83
  • 27. Estimation of AR(p) parameters Suppose we have an AR(p) process of order p ϕt = α1ϕt−1 + · · · + αpϕt−p + wt Given T observations, the parameters may be estimated by least-squares by minimizing ˆα = arg min α T t=p+1 [ϕt − α1ϕt−1 + · · · + αpϕt−p] 2 In matrix form this amounts to solve the multiple least-squares problem where Computational Intelligence Methods for Prediction – p. 27/83
  • 28. Estimation of AR(p) parameters (II) X =        ϕT −1 ϕT −2 . . . ϕT −p−1 ϕT −2 ϕT −3 . . . ϕT −p−2 ... ... ... ... ϕp ϕp−1 . . . ϕ1        Y =        ϕT ϕT −1 ... ϕp+1        (1) Most of the considerations made for multiple linear regression still hold in this case. Computational Intelligence Methods for Prediction – p. 28/83
  • 29. The NAR representation • AR models assume that the relation between past and future is linear • Once we assume that the linear assumption does not hold, we may extend the AR formulation to a Nonlinear Auto Regressive (NAR) formulation ϕt = f ϕt−d , ϕt−d−1 , . . . , ϕt−d−n+1 + w(t) where the missing information is lumped into a noise term w. • In what follows we will consider this relationship as a particular instance of a dependence y = f(x) + w between a multidimensional input x ∈ X ⊂ Rn and a scalar output y ∈ R. Computational Intelligence Methods for Prediction – p. 29/83
  • 30. Nonlinear vs. linear • AR models assume that the relation between past and future is linear • The advantage of linear models are numerous: • the least-squares ˆβ estimate can be expressed in an analytical form • the least-squares ˆβ estimate can be easily calculated through matrix computation. • statistical properties of the estimator can be easily defined. • recursive formulation for sequential updating are avaialble. • the relation between empirical and generalization error is known. • Unfortunately in real problem it is extremely unlikely that the input and ouput variables are linked by a linear relation. • Moreover, the form of the relation is often unknown and only a limited amount of samples is available. Computational Intelligence Methods for Prediction – p. 30/83
  • 31. Supervised learning TRAINING DATASET UNKNOWN DEPENDENCY INPUT OUTPUT ERROR PREDICTION MODEL PREDICTION The prediction problem is also known as the supervised learning problem, because of the presence of the outcome variable which guides the learning process. Collecting a set of training data is analogous to the situation where a teacher suggests the correct answer for each input configuration. Computational Intelligence Methods for Prediction – p. 31/83
  • 32. The regression plus noise form • A typical way of representing the unknown input/output relation is the regression plus noise form y = f(x) + w where f(·) is a deterministic function and the term w represents the noise or random error. It is typically assumed that w is independent of x and E[w] = 0. • Suppose that we have available a training set { xi, yi : i = 1, . . . , N}, where xi = (xi1, . . . , xin), generated according to the previous model. • The goal of a learning procedure is to find a model h(x) which is able to give a good approximation of the unknown function f(x). • But how to choose h,if we do not know the probability distribution underlying the data and we have only a limited training set? Computational Intelligence Methods for Prediction – p. 32/83
  • 33. A simple example −2 −1 0 1 2 −5051015 x Y Computational Intelligence Methods for Prediction – p. 33/83
  • 34. Model 1 −2 −1 0 1 2 −5051015 x Y Training error= 2 degree= 1 Computational Intelligence Methods for Prediction – p. 34/83
  • 35. Model 2 −2 −1 0 1 2 −5051015 x Y Training error= 0.92 degree= 3 Computational Intelligence Methods for Prediction – p. 35/83
  • 36. Model 3 −2 −1 0 1 2 −5051015 x Y Training error= 0.4 degree= 18 Computational Intelligence Methods for Prediction – p. 36/83
  • 37. Generalization and overfitting • How to estimate the quality of a model? Is the training error a good measure of the quality? • The goal of learning is to find a model which is able to generalize, i.e. able to return good predictions for input sets independent of the training set. • In a nonlinear setting, it is possible to find models with such a complicate structure that they have a null empirical risk. Are these models good? • Typically NOT. Since doing very well on the training set could mean doing badly on new data. • This is the phenomenon of overfitting. • Using the same data for training a model and assessing it is typically a wrong procedure, since this returns an over optimistic assessment of the model generalization capability. Computational Intelligence Methods for Prediction – p. 37/83
  • 38. Bias and variance of a model A result of estimation theory shows that the mean-squared-error, i.e. a measure of the generalization quality of an estimator can be decomposed into three terms: MSE = σ2 w + squared bias + variance where the intrinsic noise term reflects the target alone, the bias reflects the target’s relation with the learning algorithm and the variance term reflects the learning algorithm alone. This result is theoretical since these quantities cannot be measured on the basis of a finite amount of data. However, this result provide insight about what makes accurate a learning process. Computational Intelligence Methods for Prediction – p. 38/83
  • 39. The bias/variance trade-off • The first term is the variance of y around its true mean f(x) and cannot be avoided no matter how well we estimate f(x), unless σ2 w = 0. • The bias measures the difference in x between the average of the outputs of the hypothesis functions over the set of possible DN and the regression function value f(x) • The variance reflects the variability of the guessed h(x, αN ) as one varies over training sets of fixed dimension N. This quantity measures how sensitive the algorithm is to changes in the data set, regardless to the target. Computational Intelligence Methods for Prediction – p. 39/83
  • 40. The bias/variance dilemma • The designer of a learning machine has not access to the term MSE but can only estimate it on the basis of the training set. Hence, the bias/variance decomposition is relevant in practical learning since it provides a useful hint about the features to control in order to make the error MSE small. • The bias term measures the lack of representational power of the class of hypotheses. To reduce the bias term we should consider complex hypotheses which can approximate a large number of input/output mappings. • The variance term warns us against an excessive complexity of the approximator. This means that a class of too powerful hypotheses runs the risk of being excessively sensitive to the noise affecting the training set; therefore, our class could contain the target but it could be practically impossible to find it out on the basis of the available dataset. Computational Intelligence Methods for Prediction – p. 40/83
  • 41. • In other terms, it is commonly said that an hypothesis with large bias but low variance underfits the data while an hypothesis with low bias but large variance overfits the data. • In both cases, the hypothesis gives a poor representation of the target and a reasonable trade-off needs to be found. • The task of the model designer is to search for the optimal trade-off between the variance and the bias term, on the basis of the available training set. Computational Intelligence Methods for Prediction – p. 41/83
  • 43. The learning procedure A learning procedure aims at two main goals: 1. to choose a parametric family of hypothesis h(x, α) which contains or gives good approximation of the unknown function f (structural identification). 2. within the family h(x, α), to estimate on the basis of the training set DN the parameter αN which best approximates f (parametric identification). In order to accomplish that, a learning procedure is made of two nested loops: 1. an external structural identification loop which goes through different model structures 2. an inner parametric identification loop which searches for the best parameter vector within the family structure. Computational Intelligence Methods for Prediction – p. 43/83
  • 44. Parametric identification The parametric identification of the hypothesis is done according to ERM (Empirical Risk Minimization) principle where αN = α(DN ) = arg min α∈Λ MISEemp(α) minimizes the empirical risk or training error MISEemp(α) = N i=1 (yi − h(xi, α)) 2 N constructed on the basis of the data set DN . Computational Intelligence Methods for Prediction – p. 44/83
  • 45. Parametric identification (II) • The computation of αN requires a procedure of multivariate optimization in the space of parameters. • The complexity of the optimization depends on the form of h(·). • In some cases the parametric identification problem may be an NP-hard problem. • Thus, we must resort to some form of heuristic search. • Examples of parametric identification procedure are linear least-squares for linear models and backpropagated gradient-descent for feedforward neural networks. Computational Intelligence Methods for Prediction – p. 45/83
  • 46. Validation techniques How to measure MISE in a reliable way on a finite dataset? The most common techniques to return an estimate MISE are Testing: a testing sequence independent of DN and distributed according to the same probability distribution is used to assess the quality. In practice, unfortunately, an additional set of input/output observations is rarely available. Holdout: The holdout method, sometimes called test sample estimation, partitions the data DN into two mutually exclusive subsets, the training set Dtr and the holdout or test set DNts . k-fold Cross-validation: the set DN is randomly divided into k mutually exclusive test partitions of approximately equal size. The cases not found in each test partition are independently used for selecting the hypothesis which will be tested on the partition itself. The average error over all the k partitions is the cross-validated error rate. Computational Intelligence Methods for Prediction – p. 46/83
  • 47. The K-fold cross-validation This is the algorithm in detail: 1. split the dataset DN into k roughly equal-sized parts. 2. For the kth part k = 1, . . . , K, fit the model to the other K − 1 parts of the data, and calculate the prediction error of the fitted model when predicting the k-th part of the data. 3. Do the above for k = 1, . . . , K and average the K estimates of prediction error. Let k(i) be the part of DN containing the ith sample. Then the cross-validation estimate of the MISE prediction error is MISECV = 1 N N i=1 (yi − ˆy −k(i) i )2 = 1 N N i=1 yi − h(xi, α−k(i) ) 2 where ˆy −k(i) i denotes the fitted value for the ith observation returned by the model estimated with the k(i)th part of the data removed. Computational Intelligence Methods for Prediction – p. 47/83
  • 48. 10-fold cross-validation K = 10: at each iteration 90% of data are used for training and the remaining 10% for the test. 90% 10% Computational Intelligence Methods for Prediction – p. 48/83
  • 49. Leave-one-out cross validation • The cross-validation algorithm where K = N is also called the leave-one-out algorithm. • This means that for each ith sample, i = 1, . . . , N, 1. we carry out the parametric identification, leaving that observation out of the training set, 2. we compute the predicted value for the ith observation, denoted by ˆy−i i The corresponding estimate of the MISE prediction error is MISELOO = 1 N N i=1 (yi − ˆy−i i )2 = 1 N N i=1 (yi − h xi, α−i ) 2 where α−i is the set of parameters returned by the parametric identification perfomed on the training set with the ith sample set aside. Computational Intelligence Methods for Prediction – p. 49/83
  • 50. Model selection • Model selection concerns the final choice of the model structure in the set that has been proposed by model generation and assessed by model validation. • In real problems, this choice is typically a subjective issue and is often the result of a compromise between different factors, like the quantitative measures, the personal experience of the designer and the effort required to implement a particular model in practice. • Here we will consider only quantitative criteria. Two are the possible approaches: 1. the winner-takes-all approach 2. the combination of estimators approach. Computational Intelligence Methods for Prediction – p. 50/83
  • 51. Model selection N REALIZATION STOCHASTIC PROCESS VALIDATION CLASSES of HYPOTHESIS LEARNED MODEL TRAINING SET MODEL SELECTION PARAMETRIC IDENTIFICATION IDENTIFICATION ,, , , , , STRUCTURAL α ? ΛSΛ2Λ1 GN 1 α 1 N α 1 N α 2 N α 2 N αN s αN s GN 2 GN S GN 2 GN 1 GN S Computational Intelligence Methods for Prediction – p. 51/83
  • 52. Winner-takes-all The best hypothesis is selected in the set {αs N }, with s = 1, . . . , S, according to ˜s = arg min s=1,...,S MISE s A model with complexity ˜s is trained on the whole dataset DN and used for future predictions. Computational Intelligence Methods for Prediction – p. 52/83
  • 53. Winner-takes-all pseudo-code 1. for s = 1, . . . , S: (Structural loop) • for j = 1, . . . , N (a) Inner parametric identification (for l-o-o): αs N−1 = arg min α∈Λs i=1:N,i=j (yi − h(xi, α))2 (b) ej = yj − h(xj, αs N−1) • MISELOO(s) = 1 N N j=1 e2 j 2. Model selection: ˜s = arg mins=1,...,S MISELOO(s) 3. Final parametric identification: α˜s N = arg minα∈Λ˜s N i=1(yi − h(xi, α))2 4. The output prediction model is h(·, α˜s N ) Computational Intelligence Methods for Prediction – p. 53/83
  • 54. Model combination • The winner-takes-all approach is intuitively the approach which should work the best. • However, recent results in machine learning show that the performance of the final model can be improved not by choosing the model structure which is expected to predict the best but by creating a model whose output is the combination of the output of models having different structures. • The reason is that in reality any chosen hypothesis h(·, αN ) is only an estimate of the real target and, like any estimate, is affected by a bias and a variance term. • Theoretical results on the combination of estimators show that the combination of unbiased estimators leads an unbiased estimator with reduced variance. • This principle is at the basis of approaches like bagging or boosting. Computational Intelligence Methods for Prediction – p. 54/83
  • 55. Local modeling procedure The learning of a local model in xq ∈ Rn can be summarized in these steps: 1. Compute the distance between the query xq and the training samples according to a predefined metric. 2. Rank the neighbors on the basis of their distance to the query. 3. Select a subset of the k nearest neighbors according to the bandwidth which measures the size of the neighborhood. 4. Fit a local model (e.g. constant, linear,...). Each of the local approaches has one or more structural (or smoothing) parameters that control the amount of smoothing performed. Let us focus on the bandwidth selection. Computational Intelligence Methods for Prediction – p. 55/83
  • 56. The bandwidth trade-off: overfit e q 0011 0011 0011 01 01 0011 01 0011 0011 00 00 11 11 0 0 1 1 0011 0011 00001111 01 00001111 0011 01 0011 00001111 0011 0011 000000000000000000000000000000000000000000011111111111111111111111111111111111111111110 00 00 0 00 00 000000000000000000000 1 11 11 1 11 11 111111111111111111111 x y 0011 0011 0011 01 01 0011 01 0011 0011 00 00 11 11 0 0 1 1 0011 000 111 00001111 01 00001111 0011 01 0011 00001111 0011 000000 111111 0011 000000000000000000000000000000000000000000011111111111111111111111111111111111111111110 00 00 0 00 00 000000000000000000000 1 11 11 1 11 11 111111111111111111111 x y Too narrow bandwidth ⇒ overfitting ⇒ large prediction error e. In terms of bias/variance trade-off, this is typically a situation of high variance. Computational Intelligence Methods for Prediction – p. 56/83
  • 57. The bandwidth trade-off: underfit e q 0011 0011 0011 01 01 0011 01 0011 0011 00 00 11 11 0 0 1 1 0011 00001111 01 00001111 0011 01 0011 00001111 0011 0011 000000000000000000000000000000000000000000011111111111111111111111111111111111111111110 00 00 0 00 00 000000000000000000000 1 11 11 1 11 11 111111111111111111111 x y 0011 0011 0011 01 01 0011 000000 111111 0 00 1 11 00 0000 11 1111 0011 000 111 00001111 000 111 01 001100001111 00110011 000000 111111 0011 00001111 000000 111111 0011 000 111 0 00 1 11 000000 111111 00001111 000000000000000000000000000000000000000000011111111111111111111111111111111111111111110 00 00 0 00 00 000000000000000000000 1 11 11 1 11 11 111111111111111111111 x y Too large bandwidth ⇒ underfitting ⇒ large prediction error e In terms of bias/variance trade-off, this is typically a situation of high bias. Computational Intelligence Methods for Prediction – p. 57/83
  • 58. Bandwidth and bias/variance trade-off Mean Squared Error 1/Bandwith FEW NEIGHBORSMANY NEIGHBORS Bias Variance Underfitting Overfitting Computational Intelligence Methods for Prediction – p. 58/83
  • 59. The PRESS statistic • Cross-validation can provide a reliable estimate of the algorithm generalization error but it requires the training process to be repeated K times, which sometimes means a large computational effort. • In the case of linear models there exists a powerful statistical procedure to compute the leave-one-out cross-validation measure at a reduced computational cost • It is the PRESS (Prediction Sum of Squares) statistic, a simple formula which returns the leave-one-out (l-o-o) as a by-product of the least-squares. Computational Intelligence Methods for Prediction – p. 59/83
  • 60. Leave-one-out for linear models PARAMETRIC IDENTIFICATION ON N-1 SAMPLES PUT THE j-th SAMPLE ASIDE TEST ON THE j-th SAMPLE PARAMETRIC IDENTIFICATION ON N SAMPLES N TIMES TRAINING SET PRESS STATISTIC LEAVE-ONE-OUT The leave-one-out error can be computed in two equivalent ways: the slowest way (on the right) which repeats N times the training and the test procedure; the fastest way (on the left) which performs only once the parametric identification and the computation of the PRESS statistic. Computational Intelligence Methods for Prediction – p. 60/83
  • 61. The PRESS statistic • This allows a fast cross-validation without repeating N times the leave-one-out procedure. The PRESS procedure can be described as follows: 1. we use the whole training set to estimate the linear regression coefficients ˆβ = (XT X)−1 XT Y 2. This procedure is performed only once on the N samples and returns as by product the Hat matrix H = X(XT X)−1 XT 3. we compute the residual vector e, whose jth term is ej = yj − xT j ˆβ, 4. we use the PRESS statistic to compute eloo j as eloo j = ej 1 − Hjj where Hjj is the jth diagonal term of the matrix H. Computational Intelligence Methods for Prediction – p. 61/83
  • 62. The PRESS statistic Thus, the leave-one-out estimate of the local mean integrated squared error is: MISELOO = 1 N N i=1 yi − ˆyi 1 − Hii 2 Note that PRESS is not an approximation of the loo error but simply a faster way of computing it. Computational Intelligence Methods for Prediction – p. 62/83
  • 63. Selection of the number of neighbours • For a given query point xq, we can compute a set of predictions ˆyq(k) = xT q ˆβ(k) , together with a set of associated leave-one-out error vectors MISELOO(k) for a number of neighbors ranging in [kmin, kmax]. • If the selection paradigm, frequently called winner-takes-all, is adopted, the most natural way to extract a final prediction ˆyq, consists in comparing the prediction obtained for each value of k on the basis of the classical mean square error criterion: ˆyq = xT q ˆβ(ˆk), with ˆk = arg min k MISELOO(k) Computational Intelligence Methods for Prediction – p. 63/83
  • 64. Local Model combination • As an alternative to the winner-takes-all paradigm, we can use a combination of estimates. • The final prediction of the value yq is obtained as a weighted average of the best b models, where b is a parameter of the algorithm. • Suppose the predictions ˆyq(k) and the loo errors MISELOO(k) have been ordered creating a sequence of integers {ki} so that MISELOO(ki) ≤ MISELOO(kj), ∀i < j. The prediction of ˆyq is given by ˆyq = b i=1 ζi ˆyq(ki) b i=1 ζi , where the weights are the inverse of the mean square errors: ζi = 1/MISELOO(ki). Computational Intelligence Methods for Prediction – p. 64/83
  • 65. One step-ahead and iterated prediction • Once a model of the embedding mapping is available, it can be used for two objectives: one-step-ahead prediction and iterated prediction. • In one-step-ahead prediction, the n previous values of the series are available and the forecasting problem can be cast in the form of a generic regression problem • In literature a number of supervised learning approaches have been used with success to perform one-step-ahead forecasting on the basis of historical data. Computational Intelligence Methods for Prediction – p. 65/83
  • 66. One step-ahead prediction f ϕt-2 z-1 z-1 z-1 ϕt-3 ϕt-n ϕt-1 z-1 ϕt The approximator ˆf returns the prediction of the value of the time series at time t + 1 as a function of the n previous values (the rectangular box containing z−1 represents a unit delay operator, i.e., ϕt−1 = z−1 ϕt ). Computational Intelligence Methods for Prediction – p. 66/83
  • 67. Multi-step ahead prediction • The prediction of the value of a time series H > A steps ahead is called H-step-ahead prediction. • We classify the methods for H-step-ahead prediction according to two features: the horizon of the training criterion and the single-output or multi-output nature of the predictor. Computational Intelligence Methods for Prediction – p. 67/83
  • 68. Multi-step ahead prediction strategies The most common strategies are 1. Iterated: the model predicts h steps ahead by iterating a one-step-ahead predictor whose parameters are optimized to minimize the training error on one-step-ahead forecast (one-step-ahead training criterion). 2. Iterated strategy where parameters are optimized to minimize the training error on the iterated htr-step-ahead forecast (htr-step-ahead training criterion) where 1 < htr ≤ H. 3. Direct: the model makes a direct forecast at time t + h − 1, h = 1, . . . , H by modeling the time series in a multi-input single-output form 4. Direc: direct forecast but the input vector is extended at each step with predicted values. 5. MIMO: the model returns a vectorial forecast by modeling the time series in a multi-input multi-output form Computational Intelligence Methods for Prediction – p. 68/83
  • 69. Iterated (or recursive) prediction • In the case of iterated prediction, the predicted output is fed back as input for the next prediction. • Here, the inputs consist of predicted values as opposed to actual observations of the original time series. • As the feedback values are typically distorted by the errors made by the predictor in previous steps, the iterative procedure may produce undesired effects of accumulation of the error. • Low performance is expected in long horizon tasks. This is due to the fact that they are essentially models tuned with a one-step-ahead criterion which is not capable of taking temporal behavior into account. Computational Intelligence Methods for Prediction – p. 69/83
  • 70. Iterated prediction f ϕt-2 z-1 z-1 z-1 z-1 ϕt-3 ϕt-n ϕt-1 z-1 ϕt The approximator ˆf returns the prediction of the value of the time series at time t + 1 by iterating the predictions obtained in the previous steps (the rectangular box containing z−1 represents a unit delay operator, i.e., ˆϕt−1 = z−1 ˆϕt ). Computational Intelligence Methods for Prediction – p. 70/83
  • 71. Iterated with h-step training criterion • This strategy adopts one-step-ahead predictors but adapts the model selection criterion in order to take into account the multi-step-ahead objective. • Methods like Recurrent Neural Networks belong to such class. Their recurrent architecture and the associated training algorithm (temporal backpropagation) are suitable to handle the time-dependent nature of the data. • In [?] we proposed an adaptation of the Lazy Learning algorithm where the number of neighbors is optimized in order to minimize the leave-one-out error over an horizon larger than one. This technique ranked second in the 1998 KULeuven Time Series competition. • A similar technique has been proposed by [?] who won the competition. Computational Intelligence Methods for Prediction – p. 71/83
  • 72. Direct strategy • The Direct strategy [?, ?, ?] learns independently H models fh ϕt+h = fh(ϕt, . . . , ϕt−n+1) + wt+h with t ∈ {n, . . . , N − H} and h ∈ {1, . . . , H} and returns a multi-step forecast by concatenating the H predictions. Computational Intelligence Methods for Prediction – p. 72/83
  • 73. Direct strategy • Since the Direct strategy does not use any approximated values to compute the forecasts, it is not prone to any accumulation of errors, since each model is tailored for the horizon it is supposed to predict. Notwithstanding, it has some weaknesses. • Since the H models are learned independently no statistical dependencies between the predictions ˆyN+h[?, ?, ?] is considered. • Direct methods often require higher functional complexity [?] than iterated ones in order to model the stochastic dependency between two series values at two distant instants [?]. • This strategy demands a large computational time since the number of models to learn is equal to the size of the horizon. • Different machine learning models have been used to implement the Direct strategy for multi-step forecasting tasks, for instance neural networks [?], nearest neighbors [?] and decision trees [?]. Computational Intelligence Methods for Prediction – p. 73/83
  • 74. DirRec strategy • The DirRec strategy [?] combines the architectures and the principles underlying the Direct and the Recursive strategies. • DirRec computes the forecasts with different models for every horizon (like the Direct strategy) and, at each time step, it enlarges the set of inputs by adding variables corresponding to the forecasts of the previous step (like the Recursive strategy). • Unlike the previous strategies, the embedding size n is not the same for all the horizons. In other terms, the DirRec strategy learns H models fh from the time series [y1, . . . , yN ] where yt+h = fh(yt+h−1, . . . , yt−n+1) + wt+h with t ∈ {n, . . . , N − H} and h ∈ {1, . . . , H}. • The technique is prone to the curse of dimensionality. The use of feature selection is recommended for large h. Computational Intelligence Methods for Prediction – p. 74/83
  • 75. MIMO strategy • This strategy [?, ?] (also known as Joint strategy [?]) avoids the simplistic assumption of conditional independence between future values made by the Direct strategy [?, ?] by learning a single multiple-output model [yt+H, . . . , yt+1] = F(yt, . . . , yt−n+1) + w where t ∈ {n, . . . , N − H}, F : Rd → RH is a vector-valued function [?], and w ∈ RH is a noise vector with a covariance that is not necessarily diagonal [?]. • The forecasts are returned in one step by a multiple-output model ˆF where [ˆyt+H , . . . , ˆyt+1] = ˆF(yN , . . . , yN−n+1) Computational Intelligence Methods for Prediction – p. 75/83
  • 76. MIMO strategy • The rationale of the MIMO strategy is to model, between the predicted values, the stochastic dependency characterizing the time series. This strategy avoids the conditional independence assumption made by the Direct strategy as well as the accumulation of errors which plagues the Recursive strategy. So far, this strategy has been successfully applied to several real-world multi-step time series forecasting tasks [?, ?, ?, ?]. • However, the wish to preserve the stochastic dependencies constrains all the horizons to be forecasted with the same model structure. Since this constraint could reduce the flexibility of the forecasting approach [?], a variant of the MIMO strategy has been proposed in [?, ?] . Computational Intelligence Methods for Prediction – p. 76/83
  • 77. Validation of time series methods • The huge variety of strategies and algorithms that can be used to infer a predictor from observed data asks for a rigorous procedure of comparison and assessment. • Assessment demands benchmarks and benchmarking procedure. • Benchmarks can be defined by using • Simulated data obtained by simulating AR, NAR and other stochastic processes. This is particular useful for validating theoretical properties in terms of bias/variance. • Public domain benchmarks, like the one provided by Time Series Competitions. • Real measured data Computational Intelligence Methods for Prediction – p. 77/83
  • 78. Competitions • Santa Fe Time Series Prediction and Analysis Competition (1994) [?]: • International Workshop on Advanced Black-box techniques for nonlinear modeling Competition (Leuven, Belgium; 1998) • NN3 competition [?]: 111 monthly time series drawn from homogeneous population of empirical business time series. • NN5 competition [?]: 111 time series of the daily retirement amounts from independent cash machines at different, randomly selected locations across England. Computational Intelligence Methods for Prediction – p. 78/83
  • 79. MLG projects on forecasting Wireless sensor Anesthesia Car market prediction Computational Intelligence Methods for Prediction – p. 79/83
  • 80. Open-source software Computational Intelligence Methods for Prediction – p. 80/83
  • 81. All that we didn’t have time to discuss Computational Intelligence Methods for Prediction – p. 81/83
  • 82. Perspectives Computational Intelligence Methods for Prediction – p. 82/83
  • 83. Suggestions for PhD topics Computational Intelligence Methods for Prediction – p. 83/83