In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting.
Arun KejariwalStatistical Learning Principal at Machine Zone, Inc.
17. 17
REAL-TIME RECURRENT
LEARNING#*
# A Learning Algorithm for Continually Running Fully Recurrent Neural Networks [Williams and Zipser, 1989]
* A Method for Improving the Real-Time Recurrent Learning Algorithm [Catfolis, 1993]
18. UORO
A
APPROXIMATE
RTRL
UORO
[Unbiased Online Recurrent Optimization]
Works in a streaming fashion
Online, Memoryless
Avoids backtracking through past
activations and inputs
Low-rank approximation to forward-
mode automatic differentiation
Reduced computation and storage
KF-RTRL
[Kronecker Factored RTRL]
Kronecker product decomposition to
approximate the gradients
Reduces noise in the approximation
Asymptotically, smaller by a factor of n
Memory requirement equivalent to UORO
Higher computation than UORO
Not applicable to arbitrary architectures
# Unbiased Online Recurrent Optimization [Tallec and Ollivier, 2017]
#
* Approximating Real-Time Recurrent Learning with Random Kronecker Factors
[Mujika et al. 2018]
*
20. MEMORY-BASED RNN
ARCHITECTURES
20
BRNN: Bi-directional RNN
[Schuster and Paliwal, 1997]
GLU: Gated Linear Unit
[Dauphin et al. 2016]
Long Short-Term Memory: LSTM
[Hochreiter and Schmidhuber, 1996]
Gated Recurrent Unit: GRU
[Cho et al. 2014]
Gated Highway Network: GHN
[Zilly et al. 2017]
21. Neural Computation, 1997
* Figure borrowed from http://colah.github.io/posts/2015-08-Understanding-LSTMs/
(a) Forget gate (b) Input gate
(c) Output gate
St: hidden state
“The LSTM’s main idea is that, instead of compu7ng St
from St-1 directly with a matrix-vector product followed
by a nonlinearity, the LSTM directly computes St, which
is then added to St-1 to obtain St.” [Jozefowicz et al.
2015]
Resistant to vanishing gradient problem
Achieve better results when dropout is used
Adding bias of 1 to LSTM’s forget gate
*
22. Stacking d RNNs
Recurrence depth d
LONG CREDIT ASSIGNMENT
PATHS
Incorporates Highway layers inside the recurrent
transition
Highway layers in RHNs perform adaptive computation
Transform
Carry
H, T, C: Non-linear transforms
Regularization
Variational inference based dropout
* Figure borrowed from Silly et al. 2017
*
*
31. 31
Self
Relates different positions of a single sequence in order to compute a
representation of the same sequence
Also referred to as intra-attention
Global vs. Local
Global: alignment weights at are inferred from the current target state and all
the source states
Local: alignment weights at are inferred from the current target state and those
source states in the window.
Soft vs. Hard
Soft: Alignment weights are learned and placed “softly” over all patches in the
source image
Hard: only selects one patch of the image to attend to at a time
ATTENTION
FAMILY
34. ✦ Inspired by the cognitive analogy of reminding
๏ Designed to retrieve one or very few past states
✦ Incorporates a differentiable, sparse (hard) attention mechanism to select from past states
34SPARSE ATTENTIVE BACKTRACKING
TCA THROUGH
REMINDING
# Figure borrowed from Ke et al. 2018.
#
35. 35
HEALTH
CARE
# Figure borrowed from Song et al. 2018.
Multi-head Attention
Additional masking to enable causality
Inference
Diagnoses, Length of stay
Future illness, Mortality
Temporal ordering
Positional Encoding & Dense interpolation embedding
MULTI-VARIATE
Sensor measurement, Test results
Irregular sampling, Missing values and measurement errors
Heterogeneous, Presence of long range dependencies
#
37. Auto ML
Trend Anomaly Root Cause Forecast What If Optimization
Real-timeNo Code
Business Monitoring Business Forecast
No Data Scientist
ANODOT MISSION: MAKING BI AUTONOMOUS
39. 4
FINTECH
/ TREASURY DEPARTMENT
TRANSPORTATION
/ DATA SCIENCE DEPARTMENT
How many drivers
will I need tomorrow?
DEMAND FORECAST GROWTH FORECAST
Anticipate demand for inventory, products,
service calls and much more.
Anticipate revenue growth, expenses,
cash flow and other KPIs.
How many funds do I need
to allocate per currency?
Will we hit our targets
next quarter?
F O R E C A S T U S E C A S E S
FINTECH / TREASURY DEPARTMENT
X ?
X ?
TRANSPORTATION / BUSINESS OPERATIONS ALL INDUSTRIES / FINANCE DEPARTMENT
42. CONSIDERATION FOR ACCURATE FORECAST
Discovering influencing
metrics and events
1.
Ensemble of models2.
Identify and account
for data anomalies
3.
Identify and account
for different time
series behaviors
4.
43. HOW TO DISCOVER
INFLUENCING METRICS/EVENTS?
• Target time series +
forecast horizon
• Millions of
measures/events that can
used as features
INPUT:
• Step 1 is computationally
expensive for long
sequences: Use LSH for
speed
• Which correlation function
to use?
CHALLENGES:
STEP 1
Compute correlation
between target and each
measure/event (shifted by
the horizons)
STEP 2
Choose X most correlated
measures
STEP 3
Train forecast model
PROCEDURE:
45. IDENTIFYING AND ACCOUNTING
FOR DATA ANOMALIES
ANOMALIES
DEGRADE
FORECASTING
ACCURACY
How to remedy the
situation?
Discover anomalies and
use the information to
create new features:
Case 1: Anomalies can be explained by
external factors – enhance the anomalies
Case 2: Anomalies can’t be explained by
external factors – weight down the anomalies
46. •
•
•
IDENTIFYING AND ACCOUNTING
FOR DATA ANOMALIES
Case 1: Anomalies can be explained by
external factors – enhance the anomalies
Case 2: Anomalies can’t be explained by
external factors – weight down the anomalies
Discover anomalies and
use the information to
create new features:
SOLUTION:
49. POTENTIAL ADVANTAGES
● Train one model for many time series
● Less data required per time series
OPEN QUESTIONS
● Will a single model be more accurate than individual ones?
● Which types of differing behaviors impact the ability to train a
single model adversely, and which do not?
A SINGLE MODEL FOR THEM ALL?
50. A SINGLE MODEL FOR THEM ALL?
TESTING THE IMPACT OF EACH BEHAVIOR TYPE
LSTMsLSTMsLSTM for
each TS
One LSTM for
all TS
LSTMsLSTMsLSTM for
each TS
One LSTM for
all TS
One LSTM for
all TS
Train
Benchmark
Forecast
Horizontal line
Bench mark loss (absolute error)
Score 5
(model loss/
benchmark
loss)
Score 4
Score 3
Score 2
Score 1
Dataset
Compute
strength of
behavior for
each TS
High
strength
TS
Low
strength
TS
51. (by feature)
Score 5
high
Score 3
low/high
Score 1
low
Score 2
low
Score 4
high
Impact of the behavior to
mixed training
Impact of the behavior
on ability to forecast
A SINGLE MODEL FOR THEM ALL?
TESTING THE IMPACT OF EACH BEHAVIOR TYPE
52. Impact on accuracy
for joint training
Impact on accuracy for
variability of the behavior
seasonal_strength
curvature
x_pacf5
linearity
hurst
x_acf1
entropy
max_level_shift
time_level_shift
max_var_shift
time_kl_shift
unitroot_kpss
unitroot_pp
seasonal frequency
arch acf
garch_acf
seas_pacf
trough
peak
stability
lumpiness
diff2_acf10 e_acf10
diff1_acf10
x_acf10
arch_acf
max_kl_shift
2
Seasonality
Homodesdacity
1
53. MAIN
CONCLUSIONS /
SOLUTIONS
Impact on
accuracy
for joint
training
Impact on accuracy for
variability of the behavior
● Two main factors preventing simple training
of single models
● Seasonality: The frequency is the important
factor, no shape
● homoscedasticity (same variance): prevents
mixing, but strength of it impacts accuracy
overall
● Other behaviors have lower mixing impact
SOLUTIONS
● Separate TS for training based on behavior
● Embed behavior related features for a single
model training.
54. Requires efficient feature selection
1.
Preprocessing before training
boosts forecast accuracy
2.
Seasonality and homoscedasticity
are the key behaviors impacting
ability to train joint models
3.
KEY TAKEAWAYS
Discovering influencing
metrics and events
1.
Identify and account
for data anomalies
2.
Identify and account
for different time
series behaviors
3.
56. READINGS
37
[Rosenblatt]
Principles of Neurodynamics: Perceptrons
and the theory of brain mechanisms
[Eds. Anderson and Rosenfeld]
Neurocomputing: Foundations of
Research
[Eds. Rumelhart and McClelland]
Parallel and Distributed Processing
[Werbos]
The Roots of Backpropagation: From Ordered
Derivatives to Neural Networks and Political
Forecasting
[Eds. Chauvin and Rumelhart]
Backpropagation: Theory, Architectures
and Applications
[Rojas]
Neural Networks: A Systematic
Introduction
[BOOKS]
57. READINGS
38
Perceptrons [Minsky and Papert, 1969]
Une procedure d'apprentissage pour reseau a seuil assymetrique [Le Cun, 1985]
The problem of serial order in behavior [Lashley, 1951]
Beyond regression: New tools for prediction and analysis in the behavioral sciences [Werbos, 1974]
Connectionist models and their properties [Feldman and Ballard, 1982]
Learning-logic [Parker, 1985]
[EARLY WORKS]
58. READINGS
39
Learning internal representations by error propagation [Rumelhart, Hinton, and Williams, Chapter 8 in D. Rumelhart and F. McClelland, Eds.,
Parallel Distributed Processing, Vol. 1, 1986] (Generalized Delta Rule)
Generalization of backpropagation with application to a recurrent gas market model [Werbos, 1988]
Generalization of backpropagation to recurrent and higher order networks [Pineda, 1987]
Backpropagation in perceptrons with feedback [Almeida, 1987]
Second-order backpropagation: Implementing an optimal O(n) approximation to Newton's method in an artificial neural network [Parker,
1987]
Learning phonetic features using connectionist networks: an experiment in speech recognition [Watrous and Shastri, 1987] (Time-delay NN)
[BACKPROPAGATION]
59. READINGS
40
Backpropagation: Past and future [Werbos, 1988]
Adaptive state representation and estimation using recurrent connectionist networks [Williams, 1990]
Generalization of back propagation to recurrent and higher order neural networks [Pineda, 1988]
Learning state space trajectories in recurrent neural networks [Pearlmutter 1989]
Parallelism, hierarchy, scaling in time-delay neural networks for spotting Japanese phonemes/CV-syllables [Sawai et al. 1989]
The role of time in natural intelligence: implications for neural network and artificial intelligence research [Klopf and Morgan, 1990]
[BACKPROPAGATION]
60. READINGS
41
Recurrent Neural Network Regularization [Zaremba et al. 2014]
Regularizing RNNs by Stabilizing Activations [Krueger and Memisevic, 2016]
Sampling-based Gradient Regularization for Capturing Long-Term Dependencies in Recurrent Neural Networks [Chernodub and Nowicki 2016]
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks [Gal and Ghahramani, 2016]
Noisin: Unbiased Regularization for Recurrent Neural Networks [Dieng et al. 2018]
State-Regularized Recurrent Neural Networks [Wang and Niepert, 2019]
[REGULARIZATION of RNNs]
61. READINGS
42
A Decomposable Attention Model for Natural Language Inference [Parikh et al. 2016]
Hybrid Computing Using A Neural Network With Dynamic External Memory [Graves et al. 2017]
Image Transformer [Parmar et al. 2018]
Universal Transformers [Dehghani et al. 2019]
The Evolved Transformer [So et al. 2019]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [Dai et al. 2019]
[ATTENTION & TRANSFORMERS]
62. READINGS
43
Financial Time Series Prediction using hybrids of Chaos Theory, Multi-layer Perceptron and Multi-objective Evolutionary Algorithms [Ravi et
al. 2017]
Model-free Prediction of Noisy Chaotic Time Series by Deep Learning [Yeo, 2017]
DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks [Salinas et al. 2017]
Real-Valued (Medical) Time Series Generation With Recurrent Conditional GANs [Hyland et al. 2017]
R2N2: Residual Recurrent Neural Networks for Multivariate Time Series Forecasting [Goel et al. 2017]
Temporal Pattern Attention for Multivariate Time Series Forecasting [Shih et al. 2018]
[TIME SERIES PREDICTION]
63. READINGS
44
Unbiased Online Recurrent Optimization [Tallec and Ollivier, 2017]
Approximating real-time recurrent learning with random Kronecker factors [Mujika et al. 2018]
Theory and Algorithms for Forecasting Time Series [Kuznetsov and Mohri, 2018]
Foundations of Sequence-to-Sequence Modeling for Time Series [Kuznetsov and Meriet, 2018]
On the Variance Unbiased Recurrent Optimization [Cooijmans and Martens, 2019]
Backpropagation through time and the brain [Lillicrap and Santoro, 2019]
[POTPOURRI]
64. RESOURCES
45
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
A review of Dropout as applied to RNNs
https://medium.com/@bingobee01/a-review-of-dropout-as-applied-to-rnns-72e79ecd5b7b
https://distill.pub/2016/augmented-rnns/
https://distill.pub/2019/memorization-in-rnns/
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Using the latest advancements in deep learning to predict stock price movements
https://towardsdatascience.com/aifortrading-2edd6fac689d
How to Use Weight Regularization with LSTM Networks for Time Series Forecasting
https://machinelearningmastery.com/use-weight-regularization-lstm-networks-time-series-forecasting/