Time series deep learning

Time series analysis and
prediction in the deep
learning era
Alberto Arrigoni, PhD
February 2019

Time series: analysis and prediction
What will the future hold?
FuturePast
Now

Time series applications + context
Time series prediction: e.g.
demand/sales forecasting...
Use prediction for anomaly
detection: e.g. manufacturing
settings...
Counterfactual prediction:
e.g. marketing campaigns...
Show ads
Counterfactual

Time series prediction methods
(non-comprehensive list)
Classical autoregressive models Bayesian AR models
General machine learning
approaches
Deep learning
t+3

Number of time series (~ thousands)
[the SCALE problem]
Time series are often highly erratic,
intermittent or bursty (...and on highly
different scales)
~ 10 items
2 items
Product A Product B
...
(1)
(2)
Time series prediction and sales forecasting: issues
E.g. retail businesses

Time series belong to a hierarchy
of products/categories
E.g. online retailer selling clothes
Time series prediction and sales forecasting: issues
Now
Nike t-shirts
Clothes (total sales)
T-shirts total sales
~ 100
~ 1000(3)
For new products historical data is
missing (the cold-start problem)
(4)
Adidas t-shirts

Classical autoregressive models
Estimate model order (AIC, BIC)
Fit model parameters
(maximum likelihood)
Autoregressive component
Moving average component
Test residuals for
randomness
De-trending by differencing
Variance stabilization by log
or Box-Cox transformation
Workflow

Classical autoregressive models
THE PROS:
- Good explainability
- Solid theoretical background
- Very explicit model
- A lot of control as it is a manual process
THE CONS:
- Data is seldom stationary: trend,
seasonality, cycles need to modeled as
well
- Computationally intensive (one model for
each time series)
- No information sharing across time series
(apart from Hyndman’s hts approach) *
- Historical data are essential for
forecasting, (no cold-start)
* https://robjhyndman.com/publications/hierarchical/
Tech stack and packages
- Rob Hyndman’s online text:
https://otexts.com/fpp2/
- Infamous auto.arima
package, ets, tbats, garch,
stl...
- Python’s Pyramid

- Aggregate histograms over time scales
- Transform into Fourier space
- Add low/high pass filters as variables
General machine learning approach for ts prediction
Past Yt
t
Autoregressive component
- Can use any number of methods (linear, trees,
neural networks...)
- Turn the time series prediction problem into a
supervised learning problem
- Easily extendable to support multiple input
variables
- Covariates can be easily handled and
transformed through feature engineering
Covariates
E.g. feature engineering

THE PROS:
- Can model non-linear relationships
- Can model the “hierarchical structure” of the
time series through categorical variables
- Support for covariates (predictors) + feature
engineering
- One model is shared among multiple time
series
- Cold-start predictions are possible by
iteratively feeding the predictions back to the
feature space
THE CONS:
- Feature engineering takes time
- Long-term relationships between data points
need to be explicitly modeled
(autoregressive features)
General machine learning approach for ts prediction
- Sklearn, PySpark for feature
engineering, data reduction

Bayesian AR models (Facebook Prophet)
Prophet is a Bayesian GAM (Generalized Additive Model)
Linear trend with
changepoints
Seasonal
component
Holiday-specific
componentt
Sales
1) Detect changepoints in the time
series
2) Fit linear trend parameters (k and
delta)
(piecewise) linear
trends
Growth rate Growth rate
adjustment
**
** An additional ‘offset’ term has been omitted from the formula
* Implemented using STAN
*

Bayesian AR models (Facebook Prophet)
E.g. P = 365 for yearly data
Need to estimate 2N parameters (an
and bn
) using MCMC!
Prophet is a Bayesian GAM (Generalized Additive Model)
Linear trend with
changepoints
Seasonal
component
Holiday-specific
componentt
Sales

THE PROS:
- Uncertainty estimation
- Bayesian changepoint detection
- User-in-the-loop paradigm (Prophet)
- Black-box variational inference is
revolutionizing Bayesian inference
THE CONS:
- Bayesian inference takes time (the “scale”
issue)
- One model for each time series
- No information sharing among series
(unless you specify a hierarchical bayesian
model with shared parameters, but still...)
- Historical data are needed for prediction!
- Performance is often on par* with
autoregressive models
- Python/R clients for Prophet *
- R package for structural bayesian
time series models: Bsts
Bayesian AR models
* Taylor et al., Forecasting at scale* This may open endless discussions. Bottom line: depends on your data :)

Interlude: uncertainty estimation with deep learning
- Uncertainty estimation is a prerogative of Bayesian methods.
- Black box variational inference (ADVI) has sprung renewed interest towards Bayesian
neural networks, but we are not there yet in terms of performance
- A DeepMind paper from NIPS 2017 introduces a simple yet effective way to estimate
predictive uncertainty using Deep Ensembles
For a TensorFlow implementation of this paper: https://arrigonialberto86.github.io/funtime/deep_ensembles.html
“Engineering Uncertainty
Estimation in Neural Networks for
Time Series Prediction at Uber”
https://eng.uber.com/neural-network
s-uncertainty-estimation/
1) 2)

Interlude: Deep Ensembles
Train a deep learning model using a custom
final layer which parametrizes a Gaussian
distribution
Sample x from the Gaussian
distribution using fitted
parameters
Calculate loss to backpropagate the
error (using Gaussian likelihood)
(1)
(3)
(2)
Network output

What the network is learning: different
regions of the x space have different
variances
Generate a synthetic
dataset with different
variances
PREDICTION ON
TRAINING DATASET
SYNTHETIC TRAINING
DATASET
Use the network from previous
slide to predict on the training
set to see if it actually detects
variance reduction

The authors suggest to train different NNs on the
same data (the whole training set) with random
initialization
Ensemble networks (improve generalization power)
Uniformly weighted mixture model
Predictions for regions outside of
the training dataset show
increasing variance (due to
ensembling)
In addition to ‘distribution’ modeling
and ensembling the authors suggest to
use the fast gradient sign method * to
produce adversarial training example
(Not shown here)
* Goodfellow et al., 2014

Custom GaussianLayer
Let’s just do some extra work and define a
custom layer
For a TensorFlow implementation of this paper: https://arrigonialberto86.github.io/funtime/deep_ensembles.html

Custom layer returns both
mu and sigma
Build 2 weight matrices + 2
biase terms

DeepAR (Amazon)
Instead of fitting separate models for each time series we create a global model from related time
series to handle widely-varying scales through rescaling and velocity-based sampling.
Differentscales
Probabilities
~1000 time series
Past Future
Covariates
Flunkert et al., 2017

DeepAR (Amazon)
ht-1
ht
ht+1
- Use LSTM interactions in the time series
- As seen with the Deep Ensemble
architecture, we can predict parameters of
distributions at each time point (theta
vector)
- Time series need to be scaled for the
network to learn time-varying dynamics

DeepAR (Amazon)
* Likelihood/loss is customizable: Gaussian/negative binomial for count data + overdispersion
Training Prediction
*

For a commentary + code review: https://arrigonialberto86.github.io/funtime/deepar.html
DeepAR (Amazon)
The mandatory ‘AirPassengers’ prediction example (results shown on training set)
It is given that this is not the use case Amazon had in mind...

DeepAR (Amazon)
- Long-term relationships are handled by
design using LSTMs
- One model is fitted for all the time series
- The hierarchical ts structure and
inter-dependencies are captured by
using covariates (even holidays,
recurrent events etc...)
- The model can be used for cold-start
predictions (using categorical covariates
with ‘descriptive’ product information)
- Hassle-free uncertainty estimation
DeepAR and the AWS ecosystem
AWS SageMaker

Deep State Space (NIPS 2018)*
A state space model or SSM is just like an Hidden Markov Model, except the hidden states are
continuous
Observation (zt
)
update
Latent state (lt
)
update
In normal settings we would need to fit these parameters for each time series
zt-1 zt
zt+1
???
* Rangapuram et al, 2018, Deep State Space Models for Time Series Forecasting

Deep State Space (NIPS 2018)
Training
Prediction
Compute the negative
likelihood, derive the
time-varying SS
parameters using
backpropagation
Use Kalman filtering to
estimate lt
, then
recursively apply the
transition equation and the
observation model to
generate prediction
samples

- Long-term relationships are handled by
design using LSTMs
- One model is fitted for all the time
series
- The hierarchical ts structure and
inter-dependencies are captured by
ad-hoc design and components of the SS
model (even holidays, recurrent events
etc...)
- The model can be used for cold-start
predictions (using categorical covariates
with ‘descriptive’ product information)
Deep State Space (NIPS 2018)

Going forward: Deep factors with GPs *
* Maddix et al., “Deep Factors with Gaussian Processes for Forecasting”, NIPS 2018
The combination of probabilistic graphical models with deep neural networks has been an active
research area recently
Global DNN backbone and local Gaussian Process (GP). The main idea is to represent each
time series as a combination of a global time series and a corresponding local model.
gt
gt
gt
gt
RNN
zit
+ covariates Backpropagation to find RNN
parameters to produce global factors (gt
)
+ GP hyperparameters

M4 forecasting competition winner algo (Uber, 2018)
The winning idea is often the simplest!
Hybrid Exponential Smoothing-Recurrent Neural Networks (ES-RNN) method. It
mixes hand-coded parts like ES formulas with a black-box recurrent neural network
(RNN) forecasting engine.
yt-1
yt
yt+1
Deseasonalized and normalized vector of covariates + previous state
RNN results are now part of a parametric model

Classical
autoregressive
models
Bayesian models
(GAM/structural)
Classical
machine
learning
Deep learning
approaches
Scalability
Info sharing
across ts
Cold-start
predictions
Uncertainty
estimation
Unevenly spaced
time series *
Summary of performance
* DeepAR
Deep Factors
* Chen et al., Neural ordinary differential equations, 2018 / Futoma et al., 2017, Multitask GP + RNN

Deep State Space (Amazon)
Level-trend model parametrization:

DeepAR (Amazon)
Step 1 Step 2 Step 3
Training procedure:
- Predict parameters (e.g. mu,
sigma)
- Compute likelihood of the
prediction (can be Gaussian as we
have seen with Deep Ensembles)
*
- Sample next point
* Likelihood/loss is customizable: Gaussian/negative
binomial for count data + overdispersion
Training
Prediction (~ Monte Carlo)

Time series deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Time series deep learning

Similar to Time series deep learning (20)

Recently uploaded

Recently uploaded (20)

Time series deep learning