Arthur Charpentier, SIDE Summer School, July 2019
# 10 Times Series and Forecasting
Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal)
Machine Learning & Econometrics
SIDE Summer School - July 2019
@freakonometrics freakonometrics freakonometrics.hypotheses.org 1
Arthur Charpentier, SIDE Summer School, July 2019
Time Series
Time Series
A time series is a sequence of observations (yt) ordered in time.
Write yt = st + ut, with systematic part st (signal / trend) and ‘residual’ term ut
(ut) is supposed to be a strictly stationary time series
(st) might be a ‘linear’ trend, plus a seasonal cycle
Buys-Ballot (1847, Les changements p´eriodiques de
temp´erature, d´ependants de la nature du soleil et de la
lune, mis en rapport avec le pronostic du temps, d´eduits
d’observations n´eerlandaises de 1729 `a 1846) - original
probably in Dutch.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 2
Arthur Charpentier, SIDE Summer School, July 2019
Time Series
Consider the general prediction tyt+h = m(information available at time t)
1 hp <- read.csv("http:// freakonometrics .free.fr/ multiTimeline .csv",
skip =2)
2 T=86 -24
3 trainY <- ts(hp[1:T,2], frequency= 12, start= c(2012 , 6))
4 validY <- ts(hp[(T+1):nrow(hp) ,2], frequency= 12, start= c(2017 , 8))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 3
Arthur Charpentier, SIDE Summer School, July 2019
Time Series
In yt = st + ut, st can be a trend, plus a seasonal cycle
1 stats :: decompose(trainY)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 4
Arthur Charpentier, SIDE Summer School, July 2019
Time Series
The Buys-Ballot model is based on yt = st + ut, with
st = β0 + β1t +
12
h=1
γt mod 12
@freakonometrics freakonometrics freakonometrics.hypotheses.org 5
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Linear Trend ?
Various (machine learning) techniques can used, such as Laurinec (2017, Using
regression trees for forecasting double-seasonal time series with trend in R),
Bontempi et al. (2012, Machine Learning Strategies for Time Series Forecasting),
Dietterich (2002, Machine Learning for Sequential Data: A Review)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 6
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Exponential Smoothing
“when Gardner (2005) appeared, many believed that exponential smoothing should
be disregarded because it was either a special case of ARIMA modeling or an ad
hoc procedure with no statistical rationale. As McKenzie (1985) observed, this
opinion was expressed in numerous references to my paper. Since 1985, the
special case argument has been turned on its head, and today we know that
exponential smoothing methods are optimal for a very general class of state-space
models that is in fact broader than the ARIMA class.”
from Hyndman et al. (2008, Forecasting with Exponential Smoothing)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 7
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Exponential Smoothing
Exponential smoothing - Simple
From time series (yt) define a smooth version
st = α · yt + (1 − α) · st−1 = st−1 + α · (yt − st−1)
for some α ∈ (0, 1) and starting point s0 = y1 Forecast is tyt+h = st
It is called exponential smoothing since
st = αyt + (1 − α)st−1
= αyt + α(1 − α)yt−1 + (1 − α)2
st−2
= α yt + (1 − α)yt−1 + (1 − α)2
yt−2 + (1 − α)3
yt−3 + · · · + (1 − α)t−1
y1 + (1 − α)t
y
corresponding to exponentially weighted moving average
Need to adapt cross-validation techniques,
@freakonometrics freakonometrics freakonometrics.hypotheses.org 8
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Exponential Smoothing
Optimal α ? α ∈ argmin
T
t=2
2(yt − t−1yt) (leave-one-out strategy)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 9
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Exponential Smoothing
Optimal α ? α ∈ argmin
T
t=2
2(yt − t−1yt) (leave-one-out strategy)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 10
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Exponential Smoothing
Optimal α ? α ∈ argmin
T
t=2
2(yt − t−1yt) (leave-one-out strategy)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 11
Arthur Charpentier, SIDE Summer School, July 2019
See Hyndman et al. (2008, Forecasting with Exponential Smoothing)
Exponential smoothing - Double
From time series (yt) define a smooth version
st = αyt + (1 − α)(st−1 + bt−1)
bt = β(st − st−1) + (1 − β)bt−1
st = αyt + (1 − α)(st−1 + bt−1)
bt = β(st − st−1) + (1 − β)bt−1
for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0,
s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 12
Arthur Charpentier, SIDE Summer School, July 2019
See Hyndman et al. (2008, Forecasting with Exponential Smoothing)
Exponential smoothing - Double
From time series (yt) define a smooth version
st = αyt + (1 − α)(st−1 + bt−1)
bt = β(st − st−1) + (1 − β)bt−1
st = αyt + (1 − α)(st−1 + bt−1)
bt = β(st − st−1) + (1 − β)bt−1
for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0,
s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 13
Arthur Charpentier, SIDE Summer School, July 2019
See Hyndman et al. (2008, Forecasting with Exponential Smoothing)
Exponential smoothing - Double
From time series (yt) define a smooth version
st = αyt + (1 − α)(st−1 + bt−1)
bt = β(st − st−1) + (1 − β)bt−1
st = αyt + (1 − α)(st−1 + bt−1)
bt = β(st − st−1) + (1 − β)bt−1
for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0,
s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 14
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Exponential Smoothing
Exponential smoothing - Seasonal with lag L (Holt-Winters)
From time series (yt) define a smooth version



st = α
xt
ct−L
+ (1 − α)(st−1 + bt−1)
bt = β(st − st−1) + (1 − β)bt−1
ct = γ
yt
st
+ (1 − γ)ct−L
for some α ∈ (0, 1), some trend β ∈ (0, 1), some seasonal change smoothing
factor, γ ∈ (0, 1) and starting points s0 = y0. Forecast is tyt+h = (st +
hbt)ct−L+1+(h−1) mod L.
See stats::HoltWinters()
@freakonometrics freakonometrics freakonometrics.hypotheses.org 15
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Exponential Smoothing
1 hw_fit <- stats :: HoltWinters (trainY)
2 library(forecast)
3 plot(forecast(hw_fit , h=30))
4 lines(validY ,col="red")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 16
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : State Space Models
See De Livera, Hyndman & Snyder (2011, Forecasting Time Series With Complex
Seasonal Patterns Using Exponential Smoothing),based on Box-Cox transformation
on yt: y
(λ)
t =
yλ
t − 1
λ
if λ = 0 (otherwise log yt)
See forecast::tbats
1 library(forecast)
2 forecast :: tbats(trainY)$lambda
3 [1] 0.2775889
@freakonometrics freakonometrics freakonometrics.hypotheses.org 17
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : State Space Models
1 library(forecast)
2 tbats_fit <- tbats(trainY)
3 plot(forecast(tbats_fit , h=30))
4 lines(validY , col="red")
@freakonometrics freakonometrics freakonometrics.hypotheses.org 18
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : State Space Models
Exponential smoothing state space model with Box-Cox transformation, ARMA
errors, Trend and Seasonal components
y
(λ)
t = t−1 + φbt−1 +
T
i=1
s
(i)
t−mi
+ dt where
• ( t) is some local level, t = t−1 + φbt−1 + αdt
• (bt) is some trend with damping, bt = φbt−1 + αdt
• (dt) is some ARMA process for the stationary component
dt =
p
i=1
ϕidt−i + t +
q
j=1
θj t−j
• (s
(i)
t ) is the i-th seasonal component
@freakonometrics freakonometrics freakonometrics.hypotheses.org 19
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : State Space Models
Let xt denote state variables (e.g. level,
slope, seasonal).
Classical statistical approach: compute
likelihood from errors ε1, · · · , εT
see forecast::ets
Innovations state space models
Let xt = (st, bt, ct) and suppose εt i.i.d. N(0, σ2
)
State equation : xt = f(xt−1) + g(xt−1)εt
Observation equation : yt = µt + et = h(xt−1) + σ(xt−1)εt
Inference based on log L = n log
T
t=1
ε2
t
σ(xt−1)
+ 2
T
t=1
log |σ(xt−1)|
One can use time series cross-validation, valued on a rolling forecast origin
@freakonometrics freakonometrics freakonometrics.hypotheses.org 20
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : State Space Models
1 library(forecast)
2 ets_fit <- forecast ::ets(trainY)
3 plot(forecast(ets_fit , h=30))
@freakonometrics freakonometrics freakonometrics.hypotheses.org 21
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Automatic ARIMA
Consider a general seasonal ARIMA process,
Φs(Ls
)Φ(L)(1 − L)d
(1 − Ls
)ds
yt = c + Θs(Ls
)Θ(L)εt
See forecast::autoarima(, include.drift=TRUE) , “Automatic algorithms will become
more general - handling a wide variety of time series” (Rob Hyndman)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 22
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : RNN (recurrent neural nets)
RNN : Recurrent neural network
Class of neural networks where connections between nodes form a directed
graph along a temporal sequence.
Recurrent neural networks are networks with loops, allowing information to
persist.
Classical neural net yi = m(xi)
Recurrent neural net yt = m(xt, yt−1) = m(xt, m(xt−1, yt−2)) = · · ·
@freakonometrics freakonometrics freakonometrics.hypotheses.org 23
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : RNN (recurrent neural nets)
A is the neural net, h is the output (y) and x some covariates.
(source https://colah.github.io/)
See Sutskever (2017, Training Reccurent Neural Networks)
From recurrent networks to LSTM
@freakonometrics freakonometrics freakonometrics.hypotheses.org 24
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : RNN and LTSM
(source Greff et al. (2017, LSTM: A Search Space Odyssey))
see Hochreiter & Schmidhuber (1997, Long Short-Term Memory)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 25
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : RNN and LSTM
A classical RNN (with a single layer) would be
(source https://colah.github.io/)
“In theory, RNNs are absolutely capable of handling such ‘long-term
dependencies’. A human could carefully pick parameters for them to solve toy
problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn
them” see Benghio et al. (1994, Learning long-term dependencies with gradient
descent is difficult)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 26
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : RNN and LSTM
“RNNs can keep track of arbitrary long-term dependencies in the input
sequences.The problem of “vanilla RNNs” is computational (or practical) in
nature: when training a vanilla RNN using back-propagation, the gradients which
are back-propagated can “vanish” (that is, they can tend to zero) “explode” (that
is, they can tend to infinity), because of the computations involved in the process”
(from wikipedia)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 27
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : LSTM
C is the long-term state
H is the short-term state
forget gate: ft = sigmoid(Af [ht−1, xt] + bf )
input gate: it = sigmoid(Ai[ht−1, xt] + bi)
new memory cell: ˜ct = tanh(Ac[ht−1, xt] + bc)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 28
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : LSTM
final memory cell: ct = ft · ct−1 + it · ˜ct
output gate: ot = sigmoid(Ao[ht−1, xt] + bo)
ht = ot · tanh(ct)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 29
Arthur Charpentier, SIDE Summer School, July 2019
Elicitable Measures & Forecasting
“elicitable” means “being a minimizer of a suitable expected score”, see Gneiting
(2011) Making and evaluating point forecasts.
Elicitable function
T is an elicatable function if there exits a scoring function S : R×R → [0, ∞)
T(Y ) = argmin
x∈R R
S(x, y)dF(y) = argmin
x∈R
E S(x, Y ) where Y ∼ F
Example: mean, T(Y ) = E[Y ] is elicited by S(x, y) = x − y 2
2
Example: median, T(Y ) = median[Y ] is elicited by S(x, y) = x − y 1
Example: quantile, T(Y ) = QY (τ) is elicited by
S(x, y) = τ(y − x)+ + (1 − τ)(y − x)−
Example: expectile, T(Y ) = EY (τ) is elicited by
S(x, y) = τ(y − x)2
+ + (1 − τ)(y − x)2
−
@freakonometrics freakonometrics freakonometrics.hypotheses.org 30
Arthur Charpentier, SIDE Summer School, July 2019
Forecasts and Predictions
Mathematical statistics is based on inference and testing, using probabilistic
properties.
If we can reproduce past observations, it is supposed to proved good predictions.
Why not consider a collection of scenarios likely to occur on a given time horizon
(drawing fro, a (predictive) probability distribution)
The closer forecast yt is to observed yt, the better the model, either according to
1-norm - with |yt − yt| - or to the 2-norm - with (yt − yt)2
.
If this is an interesting information about central tendency, it cannot be used to
anticipate extremal events.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 31
Arthur Charpentier, SIDE Summer School, July 2019
Forecasts and Predictions
More formally, we try to compare two very different objects : a function (the
predictive probability distribution) and a real value number (the observed value).
Natural idea : introduce a score, as in Good (1952, Rational Decisions) or Winkler
(1969, Scoring Rules and the Evaluation of Probability Assessors), used in
meteorology by Murphy & Winkler (1987, A General Framework for Forecast
Verification).
Let F denote the predictive distribution, expressing the uncertainty attributed to
future values, conditional on the available information.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 32
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
Notion of probabilistic forecasts, Gneiting & Raftery (2007 Strictly Proper Scoring
Rules, Prediction, and Estimation).
In a general setting, we want to predict value taken by random variable Y .
Let F denote a cumulative distribution function.
Let A denote the information available when forecast is made.
F is the ideal forecast for Y given A if the law of Y |A has distribution F.
Suppose F continuous. Set ZF = F(Y ), the probability integral transform of Y .
F is probabilistically calibrated if ZF ∼ U([0, 1])
F is marginally calibrated if E[F(y)] = P[Y ≤ y] for any y ∈ R.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 33
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
Observe that for a ideal forecast, F(y) = P[Y ≤ y|A], then
• E[F(y)] = E[P[Y ≤ y|A]] = P[Y ≤ y]
This forecast is est marginally calibrated
• P[ZF ≤ z] = E[P[ZF ≤ z|A]] = z
This forecast is probabilistically calibrated
Suppose µ ∼ N(0, 1). And that ideal forecast is Y |µ ∼ N(µ, 1).
E.g. if Yt ∼ N(0, 1) and Yt+1 = yt + εt ∼ N(yt, 1).
One can consider F = N(0, 2) as na¨ıve forecast. This distribution is marginally
calibrated, probabilistically calibrated and ideal.
One can consider F a mixture N(µ, 2) and N(µ ± 1, 2) where ”±1” means +1 or
−1 probability 1/2, hesitating forecast. This distribution is probabilistically
calibrated, but not marginally calibrated.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 34
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
Indeed P[F(Y ) ≤ u] = u,
P[F(Y ) ≤ u] =
P[Φ(Y ) ≤ u] + P[Φ(Y + 1) ≤ u]
2
+
P[Φ(Y ) ≤ u] + P[Φ(Y − 1) ≤ u]
2
One can consider F = N(−µ, 1). This distribution is marginally calibrated, but
not probabilistically calibrated.
In practice, we have a sequence (Yt, Ft) of pairs, (Y , F ).
The set of forecasts F is said to be performant if for all t, predictive distributions
Ft are precise (sharpness) and well-calibrated.
Precision is related to the concentration of the predictive density around a
central value (uncertainty degree).
Calibration is related to the coherence between predictive distribution Ft and
observations yt.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 35
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
Calibration is poor if 80%-confidence intervals (implied from predictive
distributions, i.e. F−1
t (α), F−1
t (1 − α) ) do not contain yt’s about 8 times out of
10.
To test marginal calibration, compare the empirical cumulative distribution
function
G(y) = lim
n→∞
1
n
1Yt≤y
and the average of predictive distributios
F(y) = lim
n→∞
1
n
n
t=1
Ft(y)
To test probabilistic calibration, test if sample {Ft(Yt)} has a uniform
distribution - PIT approach, see Dawid (1984, Present Position and Potential
Developments: The Prequential Approach).
@freakonometrics freakonometrics freakonometrics.hypotheses.org 36
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
One can also consider a score S(F, y) for all distribution F and all observation y.
The score is said to be proper if
∀F, G, E[S(G, Y )] ≤ E[S(F, Y )] where Y ∼ G.
In practice, this expected value is approximated using
1
n
n
t=1
S(Ft, Yt)
One classical rule is the logarithmic score S(F, y) = − log[F (y)] if F is (abs.)
continuous.
Another classical rule is the continuous ranked probability score (CRPS, see
Hersbach (2000, Decomposition of the Continuous Ranked Probability Score for
Ensemble Prediction Systems))
S(F, y) =
+∞
−∞
(F(x) − 1x≥y)2
dx =
y
−∞
F(x)2
+
+∞
y
(F(x) − 1)2
dx
@freakonometrics freakonometrics freakonometrics.hypotheses.org 37
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
with empirical version
S =
1
n
n
t=1
S(Ft, yt) =
1
n
n
t=1
+∞
−∞
(Ft(x) − 1x≥yt
)2
dx
studied in Murphy (1970, The ranked probability score and the probability score: a
comparison).
This rule is proper since
E[S(F, Y )] =
∞
−∞
E F(x) − 1x≥Y
2
dx
=
∞
−∞
[F(x) − G(x)]2
+ G(x)[1 − G(x)]
2
dx
is minimal when F = G.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 38
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
If F corresponds to the N(µ, σ2
) distribution
S(F, y) = σ
y − µ
σ
2Φ
y − µ
σ
− 1 + 2
y − µ
σ
−
1
√
π
Observe that
S(F, y) = E X − y −
1
2
E X − X o`u X, X ∼ F
(where X and X are independent versions), cf Gneiting & Raftery (2007, Strictly
Proper Scoring Rules, Prediction, and Estimation.
If we use for F the empirical cumulative distribution function
Fn(y) =
1
n
n
i=1
1yi≤y then
S(Fn, y) =
2
n
n
i=1
(yi:n − y) 1yi:n≤y −
i − 1/2
n
@freakonometrics freakonometrics freakonometrics.hypotheses.org 39
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
Consider a Gaussian AR(p) time series,
Yt = c + ϕ1Yt−1 + · · · + ϕpYt−p + εt, with εt ∼ N(0, σ2
)
then forecast with horizon 1 yields
Ft ∼ N(t−1Yt, σ2
)
where t−1Yt = c + ϕ1Yt−1 + · · · + ϕpYt−p.
@freakonometrics freakonometrics freakonometrics.hypotheses.org 40
Arthur Charpentier, SIDE Summer School, July 2019
Probabilistic Forecasts
Suppose that Y can be explained by covariates x = (x1, · · · , xm). Consider some
kernel based conditional density estimation
p(y|x) =
p(y, x)
p(x)
=
n
i=1 Kh(y − yi)Kh(x − xi)
n
i=1 Kh(x − xi)
In the case of a linear model, there exists θ such that p(y|x) = p(y|θT
x), and
p(y|θT
x = s) =
n
i=1 Kh(y − yi)Kh(s − θT
xi)
n
i=1 Kh(s − θT
xi)
Parameter θ can be estimated using a proxy of the log-likelihood
θ = argmax
n
i=1
log p(yi|θT
xi)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 41
Arthur Charpentier, SIDE Summer School, July 2019
Time Series : Stacking
See Clemen (1989, Combining forecasts: A review and annotated bibliography)
See opera::oracle(Y = Y, experts = X, loss.type =’square’, model =’convex’)
(on Online Prediction by Expert Aggregation)
@freakonometrics freakonometrics freakonometrics.hypotheses.org 42
Arthur Charpentier, SIDE Summer School, July 2019
Natural Language Processing & Probabilistic Language Models
Idea : P[today is Wednesday] > P[today Wednesday is]
Idea : P[today is Wednesday] > P[today is Wendy]
E.g. try to predict the missing word I grew up in France, I speak fluent
@freakonometrics freakonometrics freakonometrics.hypotheses.org 43
Arthur Charpentier, SIDE Summer School, July 2019
Natural Language Processing & Probabilistic Language Models
Use of the chain rule
P[A1, A2, · · · , An] =
n
i=1
P[Ai|A1, A2, · · · , Ai−1]
P(the wine is so good)
= P(the) · P(wine|the) · P(is|the wine) · P(so|the wine is) · P(good|the wine is so)
Markov assumption & k-gram model
P[A1, A2, · · · , An] ∼
n
i=1
P[Ai|Ai−k, · · · , Ai−1]
@freakonometrics freakonometrics freakonometrics.hypotheses.org 44

Side 2019 #10

  • 1.
    Arthur Charpentier, SIDESummer School, July 2019 # 10 Times Series and Forecasting Arthur Charpentier (Universit´e du Qu´ebec `a Montr´eal) Machine Learning & Econometrics SIDE Summer School - July 2019 @freakonometrics freakonometrics freakonometrics.hypotheses.org 1
  • 2.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series Time Series A time series is a sequence of observations (yt) ordered in time. Write yt = st + ut, with systematic part st (signal / trend) and ‘residual’ term ut (ut) is supposed to be a strictly stationary time series (st) might be a ‘linear’ trend, plus a seasonal cycle Buys-Ballot (1847, Les changements p´eriodiques de temp´erature, d´ependants de la nature du soleil et de la lune, mis en rapport avec le pronostic du temps, d´eduits d’observations n´eerlandaises de 1729 `a 1846) - original probably in Dutch. @freakonometrics freakonometrics freakonometrics.hypotheses.org 2
  • 3.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series Consider the general prediction tyt+h = m(information available at time t) 1 hp <- read.csv("http:// freakonometrics .free.fr/ multiTimeline .csv", skip =2) 2 T=86 -24 3 trainY <- ts(hp[1:T,2], frequency= 12, start= c(2012 , 6)) 4 validY <- ts(hp[(T+1):nrow(hp) ,2], frequency= 12, start= c(2017 , 8)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 3
  • 4.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series In yt = st + ut, st can be a trend, plus a seasonal cycle 1 stats :: decompose(trainY) @freakonometrics freakonometrics freakonometrics.hypotheses.org 4
  • 5.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series The Buys-Ballot model is based on yt = st + ut, with st = β0 + β1t + 12 h=1 γt mod 12 @freakonometrics freakonometrics freakonometrics.hypotheses.org 5
  • 6.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Linear Trend ? Various (machine learning) techniques can used, such as Laurinec (2017, Using regression trees for forecasting double-seasonal time series with trend in R), Bontempi et al. (2012, Machine Learning Strategies for Time Series Forecasting), Dietterich (2002, Machine Learning for Sequential Data: A Review) @freakonometrics freakonometrics freakonometrics.hypotheses.org 6
  • 7.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Exponential Smoothing “when Gardner (2005) appeared, many believed that exponential smoothing should be disregarded because it was either a special case of ARIMA modeling or an ad hoc procedure with no statistical rationale. As McKenzie (1985) observed, this opinion was expressed in numerous references to my paper. Since 1985, the special case argument has been turned on its head, and today we know that exponential smoothing methods are optimal for a very general class of state-space models that is in fact broader than the ARIMA class.” from Hyndman et al. (2008, Forecasting with Exponential Smoothing) @freakonometrics freakonometrics freakonometrics.hypotheses.org 7
  • 8.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Exponential Smoothing Exponential smoothing - Simple From time series (yt) define a smooth version st = α · yt + (1 − α) · st−1 = st−1 + α · (yt − st−1) for some α ∈ (0, 1) and starting point s0 = y1 Forecast is tyt+h = st It is called exponential smoothing since st = αyt + (1 − α)st−1 = αyt + α(1 − α)yt−1 + (1 − α)2 st−2 = α yt + (1 − α)yt−1 + (1 − α)2 yt−2 + (1 − α)3 yt−3 + · · · + (1 − α)t−1 y1 + (1 − α)t y corresponding to exponentially weighted moving average Need to adapt cross-validation techniques, @freakonometrics freakonometrics freakonometrics.hypotheses.org 8
  • 9.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Exponential Smoothing Optimal α ? α ∈ argmin T t=2 2(yt − t−1yt) (leave-one-out strategy) @freakonometrics freakonometrics freakonometrics.hypotheses.org 9
  • 10.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Exponential Smoothing Optimal α ? α ∈ argmin T t=2 2(yt − t−1yt) (leave-one-out strategy) @freakonometrics freakonometrics freakonometrics.hypotheses.org 10
  • 11.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Exponential Smoothing Optimal α ? α ∈ argmin T t=2 2(yt − t−1yt) (leave-one-out strategy) @freakonometrics freakonometrics freakonometrics.hypotheses.org 11
  • 12.
    Arthur Charpentier, SIDESummer School, July 2019 See Hyndman et al. (2008, Forecasting with Exponential Smoothing) Exponential smoothing - Double From time series (yt) define a smooth version st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0, s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 12
  • 13.
    Arthur Charpentier, SIDESummer School, July 2019 See Hyndman et al. (2008, Forecasting with Exponential Smoothing) Exponential smoothing - Double From time series (yt) define a smooth version st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0, s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 13
  • 14.
    Arthur Charpentier, SIDESummer School, July 2019 See Hyndman et al. (2008, Forecasting with Exponential Smoothing) Exponential smoothing - Double From time series (yt) define a smooth version st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 st = αyt + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 for some α ∈ (0, 1), some trend β ∈ (0, 1) and starting points s0 and b0, s0 = y0 and b0 = y1 − y0. Forecast is tyt+h = st + h · bt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 14
  • 15.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Exponential Smoothing Exponential smoothing - Seasonal with lag L (Holt-Winters) From time series (yt) define a smooth version    st = α xt ct−L + (1 − α)(st−1 + bt−1) bt = β(st − st−1) + (1 − β)bt−1 ct = γ yt st + (1 − γ)ct−L for some α ∈ (0, 1), some trend β ∈ (0, 1), some seasonal change smoothing factor, γ ∈ (0, 1) and starting points s0 = y0. Forecast is tyt+h = (st + hbt)ct−L+1+(h−1) mod L. See stats::HoltWinters() @freakonometrics freakonometrics freakonometrics.hypotheses.org 15
  • 16.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Exponential Smoothing 1 hw_fit <- stats :: HoltWinters (trainY) 2 library(forecast) 3 plot(forecast(hw_fit , h=30)) 4 lines(validY ,col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 16
  • 17.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : State Space Models See De Livera, Hyndman & Snyder (2011, Forecasting Time Series With Complex Seasonal Patterns Using Exponential Smoothing),based on Box-Cox transformation on yt: y (λ) t = yλ t − 1 λ if λ = 0 (otherwise log yt) See forecast::tbats 1 library(forecast) 2 forecast :: tbats(trainY)$lambda 3 [1] 0.2775889 @freakonometrics freakonometrics freakonometrics.hypotheses.org 17
  • 18.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : State Space Models 1 library(forecast) 2 tbats_fit <- tbats(trainY) 3 plot(forecast(tbats_fit , h=30)) 4 lines(validY , col="red") @freakonometrics freakonometrics freakonometrics.hypotheses.org 18
  • 19.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : State Space Models Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components y (λ) t = t−1 + φbt−1 + T i=1 s (i) t−mi + dt where • ( t) is some local level, t = t−1 + φbt−1 + αdt • (bt) is some trend with damping, bt = φbt−1 + αdt • (dt) is some ARMA process for the stationary component dt = p i=1 ϕidt−i + t + q j=1 θj t−j • (s (i) t ) is the i-th seasonal component @freakonometrics freakonometrics freakonometrics.hypotheses.org 19
  • 20.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : State Space Models Let xt denote state variables (e.g. level, slope, seasonal). Classical statistical approach: compute likelihood from errors ε1, · · · , εT see forecast::ets Innovations state space models Let xt = (st, bt, ct) and suppose εt i.i.d. N(0, σ2 ) State equation : xt = f(xt−1) + g(xt−1)εt Observation equation : yt = µt + et = h(xt−1) + σ(xt−1)εt Inference based on log L = n log T t=1 ε2 t σ(xt−1) + 2 T t=1 log |σ(xt−1)| One can use time series cross-validation, valued on a rolling forecast origin @freakonometrics freakonometrics freakonometrics.hypotheses.org 20
  • 21.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : State Space Models 1 library(forecast) 2 ets_fit <- forecast ::ets(trainY) 3 plot(forecast(ets_fit , h=30)) @freakonometrics freakonometrics freakonometrics.hypotheses.org 21
  • 22.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Automatic ARIMA Consider a general seasonal ARIMA process, Φs(Ls )Φ(L)(1 − L)d (1 − Ls )ds yt = c + Θs(Ls )Θ(L)εt See forecast::autoarima(, include.drift=TRUE) , “Automatic algorithms will become more general - handling a wide variety of time series” (Rob Hyndman) @freakonometrics freakonometrics freakonometrics.hypotheses.org 22
  • 23.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : RNN (recurrent neural nets) RNN : Recurrent neural network Class of neural networks where connections between nodes form a directed graph along a temporal sequence. Recurrent neural networks are networks with loops, allowing information to persist. Classical neural net yi = m(xi) Recurrent neural net yt = m(xt, yt−1) = m(xt, m(xt−1, yt−2)) = · · · @freakonometrics freakonometrics freakonometrics.hypotheses.org 23
  • 24.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : RNN (recurrent neural nets) A is the neural net, h is the output (y) and x some covariates. (source https://colah.github.io/) See Sutskever (2017, Training Reccurent Neural Networks) From recurrent networks to LSTM @freakonometrics freakonometrics freakonometrics.hypotheses.org 24
  • 25.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : RNN and LTSM (source Greff et al. (2017, LSTM: A Search Space Odyssey)) see Hochreiter & Schmidhuber (1997, Long Short-Term Memory) @freakonometrics freakonometrics freakonometrics.hypotheses.org 25
  • 26.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : RNN and LSTM A classical RNN (with a single layer) would be (source https://colah.github.io/) “In theory, RNNs are absolutely capable of handling such ‘long-term dependencies’. A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them” see Benghio et al. (1994, Learning long-term dependencies with gradient descent is difficult) @freakonometrics freakonometrics freakonometrics.hypotheses.org 26
  • 27.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : RNN and LSTM “RNNs can keep track of arbitrary long-term dependencies in the input sequences.The problem of “vanilla RNNs” is computational (or practical) in nature: when training a vanilla RNN using back-propagation, the gradients which are back-propagated can “vanish” (that is, they can tend to zero) “explode” (that is, they can tend to infinity), because of the computations involved in the process” (from wikipedia) @freakonometrics freakonometrics freakonometrics.hypotheses.org 27
  • 28.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : LSTM C is the long-term state H is the short-term state forget gate: ft = sigmoid(Af [ht−1, xt] + bf ) input gate: it = sigmoid(Ai[ht−1, xt] + bi) new memory cell: ˜ct = tanh(Ac[ht−1, xt] + bc) @freakonometrics freakonometrics freakonometrics.hypotheses.org 28
  • 29.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : LSTM final memory cell: ct = ft · ct−1 + it · ˜ct output gate: ot = sigmoid(Ao[ht−1, xt] + bo) ht = ot · tanh(ct) @freakonometrics freakonometrics freakonometrics.hypotheses.org 29
  • 30.
    Arthur Charpentier, SIDESummer School, July 2019 Elicitable Measures & Forecasting “elicitable” means “being a minimizer of a suitable expected score”, see Gneiting (2011) Making and evaluating point forecasts. Elicitable function T is an elicatable function if there exits a scoring function S : R×R → [0, ∞) T(Y ) = argmin x∈R R S(x, y)dF(y) = argmin x∈R E S(x, Y ) where Y ∼ F Example: mean, T(Y ) = E[Y ] is elicited by S(x, y) = x − y 2 2 Example: median, T(Y ) = median[Y ] is elicited by S(x, y) = x − y 1 Example: quantile, T(Y ) = QY (τ) is elicited by S(x, y) = τ(y − x)+ + (1 − τ)(y − x)− Example: expectile, T(Y ) = EY (τ) is elicited by S(x, y) = τ(y − x)2 + + (1 − τ)(y − x)2 − @freakonometrics freakonometrics freakonometrics.hypotheses.org 30
  • 31.
    Arthur Charpentier, SIDESummer School, July 2019 Forecasts and Predictions Mathematical statistics is based on inference and testing, using probabilistic properties. If we can reproduce past observations, it is supposed to proved good predictions. Why not consider a collection of scenarios likely to occur on a given time horizon (drawing fro, a (predictive) probability distribution) The closer forecast yt is to observed yt, the better the model, either according to 1-norm - with |yt − yt| - or to the 2-norm - with (yt − yt)2 . If this is an interesting information about central tendency, it cannot be used to anticipate extremal events. @freakonometrics freakonometrics freakonometrics.hypotheses.org 31
  • 32.
    Arthur Charpentier, SIDESummer School, July 2019 Forecasts and Predictions More formally, we try to compare two very different objects : a function (the predictive probability distribution) and a real value number (the observed value). Natural idea : introduce a score, as in Good (1952, Rational Decisions) or Winkler (1969, Scoring Rules and the Evaluation of Probability Assessors), used in meteorology by Murphy & Winkler (1987, A General Framework for Forecast Verification). Let F denote the predictive distribution, expressing the uncertainty attributed to future values, conditional on the available information. @freakonometrics freakonometrics freakonometrics.hypotheses.org 32
  • 33.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts Notion of probabilistic forecasts, Gneiting & Raftery (2007 Strictly Proper Scoring Rules, Prediction, and Estimation). In a general setting, we want to predict value taken by random variable Y . Let F denote a cumulative distribution function. Let A denote the information available when forecast is made. F is the ideal forecast for Y given A if the law of Y |A has distribution F. Suppose F continuous. Set ZF = F(Y ), the probability integral transform of Y . F is probabilistically calibrated if ZF ∼ U([0, 1]) F is marginally calibrated if E[F(y)] = P[Y ≤ y] for any y ∈ R. @freakonometrics freakonometrics freakonometrics.hypotheses.org 33
  • 34.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts Observe that for a ideal forecast, F(y) = P[Y ≤ y|A], then • E[F(y)] = E[P[Y ≤ y|A]] = P[Y ≤ y] This forecast is est marginally calibrated • P[ZF ≤ z] = E[P[ZF ≤ z|A]] = z This forecast is probabilistically calibrated Suppose µ ∼ N(0, 1). And that ideal forecast is Y |µ ∼ N(µ, 1). E.g. if Yt ∼ N(0, 1) and Yt+1 = yt + εt ∼ N(yt, 1). One can consider F = N(0, 2) as na¨ıve forecast. This distribution is marginally calibrated, probabilistically calibrated and ideal. One can consider F a mixture N(µ, 2) and N(µ ± 1, 2) where ”±1” means +1 or −1 probability 1/2, hesitating forecast. This distribution is probabilistically calibrated, but not marginally calibrated. @freakonometrics freakonometrics freakonometrics.hypotheses.org 34
  • 35.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts Indeed P[F(Y ) ≤ u] = u, P[F(Y ) ≤ u] = P[Φ(Y ) ≤ u] + P[Φ(Y + 1) ≤ u] 2 + P[Φ(Y ) ≤ u] + P[Φ(Y − 1) ≤ u] 2 One can consider F = N(−µ, 1). This distribution is marginally calibrated, but not probabilistically calibrated. In practice, we have a sequence (Yt, Ft) of pairs, (Y , F ). The set of forecasts F is said to be performant if for all t, predictive distributions Ft are precise (sharpness) and well-calibrated. Precision is related to the concentration of the predictive density around a central value (uncertainty degree). Calibration is related to the coherence between predictive distribution Ft and observations yt. @freakonometrics freakonometrics freakonometrics.hypotheses.org 35
  • 36.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts Calibration is poor if 80%-confidence intervals (implied from predictive distributions, i.e. F−1 t (α), F−1 t (1 − α) ) do not contain yt’s about 8 times out of 10. To test marginal calibration, compare the empirical cumulative distribution function G(y) = lim n→∞ 1 n 1Yt≤y and the average of predictive distributios F(y) = lim n→∞ 1 n n t=1 Ft(y) To test probabilistic calibration, test if sample {Ft(Yt)} has a uniform distribution - PIT approach, see Dawid (1984, Present Position and Potential Developments: The Prequential Approach). @freakonometrics freakonometrics freakonometrics.hypotheses.org 36
  • 37.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts One can also consider a score S(F, y) for all distribution F and all observation y. The score is said to be proper if ∀F, G, E[S(G, Y )] ≤ E[S(F, Y )] where Y ∼ G. In practice, this expected value is approximated using 1 n n t=1 S(Ft, Yt) One classical rule is the logarithmic score S(F, y) = − log[F (y)] if F is (abs.) continuous. Another classical rule is the continuous ranked probability score (CRPS, see Hersbach (2000, Decomposition of the Continuous Ranked Probability Score for Ensemble Prediction Systems)) S(F, y) = +∞ −∞ (F(x) − 1x≥y)2 dx = y −∞ F(x)2 + +∞ y (F(x) − 1)2 dx @freakonometrics freakonometrics freakonometrics.hypotheses.org 37
  • 38.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts with empirical version S = 1 n n t=1 S(Ft, yt) = 1 n n t=1 +∞ −∞ (Ft(x) − 1x≥yt )2 dx studied in Murphy (1970, The ranked probability score and the probability score: a comparison). This rule is proper since E[S(F, Y )] = ∞ −∞ E F(x) − 1x≥Y 2 dx = ∞ −∞ [F(x) − G(x)]2 + G(x)[1 − G(x)] 2 dx is minimal when F = G. @freakonometrics freakonometrics freakonometrics.hypotheses.org 38
  • 39.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts If F corresponds to the N(µ, σ2 ) distribution S(F, y) = σ y − µ σ 2Φ y − µ σ − 1 + 2 y − µ σ − 1 √ π Observe that S(F, y) = E X − y − 1 2 E X − X o`u X, X ∼ F (where X and X are independent versions), cf Gneiting & Raftery (2007, Strictly Proper Scoring Rules, Prediction, and Estimation. If we use for F the empirical cumulative distribution function Fn(y) = 1 n n i=1 1yi≤y then S(Fn, y) = 2 n n i=1 (yi:n − y) 1yi:n≤y − i − 1/2 n @freakonometrics freakonometrics freakonometrics.hypotheses.org 39
  • 40.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts Consider a Gaussian AR(p) time series, Yt = c + ϕ1Yt−1 + · · · + ϕpYt−p + εt, with εt ∼ N(0, σ2 ) then forecast with horizon 1 yields Ft ∼ N(t−1Yt, σ2 ) where t−1Yt = c + ϕ1Yt−1 + · · · + ϕpYt−p. @freakonometrics freakonometrics freakonometrics.hypotheses.org 40
  • 41.
    Arthur Charpentier, SIDESummer School, July 2019 Probabilistic Forecasts Suppose that Y can be explained by covariates x = (x1, · · · , xm). Consider some kernel based conditional density estimation p(y|x) = p(y, x) p(x) = n i=1 Kh(y − yi)Kh(x − xi) n i=1 Kh(x − xi) In the case of a linear model, there exists θ such that p(y|x) = p(y|θT x), and p(y|θT x = s) = n i=1 Kh(y − yi)Kh(s − θT xi) n i=1 Kh(s − θT xi) Parameter θ can be estimated using a proxy of the log-likelihood θ = argmax n i=1 log p(yi|θT xi) @freakonometrics freakonometrics freakonometrics.hypotheses.org 41
  • 42.
    Arthur Charpentier, SIDESummer School, July 2019 Time Series : Stacking See Clemen (1989, Combining forecasts: A review and annotated bibliography) See opera::oracle(Y = Y, experts = X, loss.type =’square’, model =’convex’) (on Online Prediction by Expert Aggregation) @freakonometrics freakonometrics freakonometrics.hypotheses.org 42
  • 43.
    Arthur Charpentier, SIDESummer School, July 2019 Natural Language Processing & Probabilistic Language Models Idea : P[today is Wednesday] > P[today Wednesday is] Idea : P[today is Wednesday] > P[today is Wendy] E.g. try to predict the missing word I grew up in France, I speak fluent @freakonometrics freakonometrics freakonometrics.hypotheses.org 43
  • 44.
    Arthur Charpentier, SIDESummer School, July 2019 Natural Language Processing & Probabilistic Language Models Use of the chain rule P[A1, A2, · · · , An] = n i=1 P[Ai|A1, A2, · · · , Ai−1] P(the wine is so good) = P(the) · P(wine|the) · P(is|the wine) · P(so|the wine is) · P(good|the wine is so) Markov assumption & k-gram model P[A1, A2, · · · , An] ∼ n i=1 P[Ai|Ai−k, · · · , Ai−1] @freakonometrics freakonometrics freakonometrics.hypotheses.org 44