Deep Learning for Time Series Data

Deep Learning for Time Series Data
ARUN KEJARIWAL
@arun_kejariwal
TheAIconf.com in San Francisco
September 2018

2
About Me
Product focus
Building and Scaling Teams
Advancing the
state-of-the-art
Scalability
Performance

3
Media
Fake News
Security
Threat Detection
Finance
Control Systems
Malfunction Detection
Operations
Availability
Performance
Forecasting
Applications
Time Series

4
Rule-based
μ ± 3*σ
Median, MAD
Tests
Generalized ESD
Underlying assumptions
Statistical
Forecasting
Seasonality
Trend
Techniques
ARIMA, SARIMA
Robust Kalman Filter
Time Series Analysis
Clustering
Other Techniques
PCA
OneSVM
Isolation Forests
(Un) Supervised
Autoencoder
Variational Autoencoder
LSTM
GRU
Clockwork RNN
Depth Gated RNN
Deep Learning
Anomaly Detection
Required in every application domain

5
HISTORY
Neural Network Research

6
Recurrent Neural Networks
Long History!
RTRL
TDNN
BPTT
NARX
[Robinson and Fallside 1987]
[Waibel 1987]
[Werbos 1988]
[Lin et al. 1996]

St: hidden state
“The LSTM’s main idea is that, instead of compuEng St from St-1
directly with a matrix-vector product followed by a nonlinearity,
the LSTM directly computes St, which is then added to St-1 to
obtain St.” [Jozefowicz et al. 2015]
7
RNN: Long Short-Term Memory
Over 20 yrs!
Neural Computation, 1997
*
* Figure borrowed from http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Resistant to vanishing gradient problem
Achieve better results when dropout is used
Adding bias of 1 to LSTM’s forget gate(a) Forget gate (b) Input gate
(c) Output gate

8
RNN: Long Short-Term Memory
Application to Anomaly Detection
*Prediction Error
Finding pattern anomalies
No need for a fixed size window for model estimation
Works with non-stationary time series with irregular
structure
LSTM Encoder-Decoder
Explore other extensions of LSTMs such as
GRU, Memory Networks, Convolutional LSTM,
Quasi-RNN, Dilated RNN, Skip RNN, HF-RNN,
Bi-RNN, Zoneout (regularizing RNN), TAGM

9
Forecasting
Financial
Translation
Machine
Synthesis
Speech
Modeling
NLP
Sequence Modeling
!
!
!
!
[1] “A Critical Review of Recurrent Neural Networks for Sequence Learning”, by Lipton et al., 2015. (https://arxiv.org/abs/1506.00019)
[2] “Sequence to sequence learning with Neural Networks”, by Sutskever et al., 2014. (https://arxiv.org/abs/1409.3215)

10
Alternatives
To RNNs
TCN*
Temporal Convolutional Network
=
1D Fully-Convolueonal Network
+
Causal convolueons
Assumptions and Optimizations
“Stability”
Dilation
Residual Connections
Advantages
Inference Speed
Parallelizability
Trainability
Feed Forward Models#
Gated-Convolutional Language Model
Universal Transformer
WaveNet
ù
ù

* "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", by Bai et al. 2018. (https://arxiv.org/pdf/1803.01271.pdf)
# “When Recurrent Models Don't Need to be Recurrent”, by Miller and Hardt, 2018. (https://arxiv.org/pdf/1805.10369.pdf)
“… the “infinite memory” advantage of RNNs is largely absent in practice.”
“The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history.”
[Bai et al. 2018]
｛

11
Challenges
Going Beyond
Large Number of
Time Series
Up to Hundreds of
Millions
Dashboarding
Impractical
Sea of
Anomalies
Fatigue
Root Cause
Analysis
Long TTD

12
Multi-variate Analysis
Curse of Dimensionality
Computationally Expensive
Real-time constraints
Dimensionality Reduction
PCA, SVD
Recency
Decision making

14
Correlation Analysis
An Example

15
Another Example

16
Surfacing Actionable Insights

17
Surfacing Actionable Insights
Correlation matrix
Bird’s eye view
Not Scalable (O(n2))
Millions of time series/data streams
Meaningless correlations
Lack of context
Systems domain
Exploiting topology

18
Pearson
[Pearson 1900]
Goodman and Kruskal 𝛾
[Goodman and Kruskal ’54]
Kandal 𝜏
[Kendall ‘38]
Spearman 𝜌
[Spearman 1904, 1906]
Somer’s D
[Somer ’62]
Cohen’s 𝜅
[Cohen ‘60]
Cramer’s V
[Cramer '46]
Correlation Coefficients
Different Flavors

19
Pearson Correlation
A Deep Dive
Robustness
Sensitive to outliers
Amenable to incremental computation
Linear correlation
Susceptible to
Curvature
Magnitude of the residuals
Not rotation invariant
Susceptible to Heteroscedasticity
Trade-off
Speed
Accuracy
* Figure borrowed from “Robust Correlation: Theory and Applications”,by Shevlyakov and Oja.

20
Modalities*
Time Series
T I
A V
Text
Sentiment Analysis
Image
Animated Gifs
Audio
Digital Assistants
Video
Live Streams
H Haptic
Robotics
“Multimodal Machine Learning: A Survey and Taxonomy”,by Baltrušaitis et al., 2017.

21
Other Modalities
Research Opportunity
Smell Taste

22
Feature Extraction
[Fu et al. 2008]
Correlation
Embedding Analysis
Feature Extraction
[Fu et al. 2008]
Correlational PCA
Common Representation Learning
[Chandar et al. 2017]
Correlational NN
Face Detection
[Deng et al. 2017]
Correlation Loss
Object Tracking
[Valmadre et al. 2017]
CFNet
LEVERAGING CORRELATION
Deep Learning Context

23
Loss Functions
Different Flavors
Class separability of features (minimize interclass correlation)
Softmax Loss
Improved Triplet Loss [Cheng et al. 2016]
Triplet Loss [Schroff et al. 2015]
Center Invariant Loss [Wu et al. 2017]
Center Loss [Wen et al. 2016]
Larger inter-class variaeon and a smaller intra-class variaeon
Quadruplet Loss [Chen et al. 2017]
Separability and discriminatory ability of features
(maximize intraclass correlation)
Correlation Loss [Deng et al. 2017]

Correlation Loss
Deep Dive
Normalization
Non-linearly changes the distribution
Yields non-Gaussian distribution
Uncentered Pearson Correlation
Angle similarity
Insensitive to anomalies
Enlarge margins amongst different classes
Softmaxloss
Correlationloss
Distribution of Deeply Learned Features*
* Figure borrowed from [Deng et al. 2017].

25
CANONICAL CORRELATION ANALYSIS
Common Representation Learning
Deep Generalized
[Benton et al. 2017]
Deep Discriminative
[Dorfer and Widmer 2016]
Deep Variational
[Wang et al. 2016]
Soft Decorrelation
[Chang et al. 2017]
Maximize correlation of the views when projected to a common subspace
Minimize self and cross-reconstruction error and maximize correlation
Leverage CRL for transfer learning - high commercial potential

27
Spurious Correlation
Long History

28
Spurious Correlation
Lack Of Context
* http://www.tylervigen.com/spurious-correlations
*
# * http://www.tylervigen.com/spurious-correlations
#

29
Nonsense Correlation
Long History

32
Parallel and
Distributed Processing
Eds. Rumelhart and
McClelland, ‘86
Neurocomputing:
Foundations of Research
Anderson and
Rosenfeld 1988
The Roots of
Backpropagation
Werbos 1994
Neural Networks: A
Systematic Introduction
Rojas 1996
READINGSSurveys & Books
"

33
READINGSSurveys & Books
Deep Learning
LeCun et al. 2015
Deep Learning in Neural Networks: An overview
Schmidhuber 2015
Deep Learning
Goodfellow et al. 2016
Neuro-dynamic programming
Bertsekas 1996

34
[Werbos ’74]
Beyond Regression: New Tools for Prediction
and Analysis in the Behavioral Sciences
[Parker ’82]
Learning Logic
[Rumelhart, Hinton and Williams ‘86]
Learning internal representations by error
propagation
[Lippmann ’87]
An introduction to computing with neural
networks

35
[Widrow and Lehr ’90]
30 Years of Adaptive Neural Networks:
Perceptron, Madaline, and Backpropagation
[Wang and Raj ’17]
On the Origin of Deep Learning
[Arulkumaran et al. ’17]
A Brief Survey of Deep Reinforcement Learning
[Alom et al. ’18]
The History Began from AlexNet: A
Comprehensive Survey on Deep
Learning Approaches

36
[Higham and Higham ’18]
Deep Learning: An Introduceon
for Applied Mathemaecians
[Marcus ’18]
Deep Learning: A Criecal Appraisal

37
READINGS
Anomaly Detection
Understanding anomaly detection
safaribooksonline.com/library/view/understanding-anomaly-deteceon/9781491983676/
https://www.slideshare.net/arunkejariwal/anomaly-detection-in-realtime-data-streams-using-heron
“Variational Inference For On-Line Anomaly Detection In High-Dimensional Time Series”, by Sölch et al., 2016.
https://arxiv.org/pdf/1602.07109.pdf
https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265
“Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction”, by Sakurada and Yairi 2014.
https://dl.acm.org/citation.cfm?id=2689747
“On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-Time Streaming Data”, by Choudhary et al., 2017.
https://arxiv.org/abs/1710.04735

38
READINGS
RNNs
“Learning to Forget: Continual Prediction with LSTM”, by Gers et al., 2000.
“Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, by Chung et al., 2014.
“An Empirical Exploration Of Recurrent Network Architectures”, by Jozefowicz et al., 2015.
“On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”, by Cho et al., 2014.
“Visualizing and Understanding Recurrent Networks”, by Karpathy et al., 2015.
“LSTM: A Search Space Odyssey”, by Greﬀ et al., 2017.

39
READINGS
Deep Learning based Multi-View Learning
“Deep Multimodal Autoencoders”, by Ngiam et al., 2011.
“Extending Long Short-Term Memory for Multi-View Structured Learning”, by Rajagopalan et al., 2016.
“Compressing Recurrent Neural Network With Tensor Train”, by Tjandra et al., 2017.
“Deep Canonically Correlated Autoencoders”, by Wang et al., 2015.
“Multimodal Tensor Fusion Network”, by Zadeh et al., 2017.
“Memory Fusion Network for Multi-View Sequential Learning”, by Zadeh et al., 2018.

40
RESOURCESHistory of Neural Networks
http://www.andreykurenkov.com/writing/ai/a-brief-history-of-neural-nets-and-deep-learning/
http://people.idsia.ch/~juergen/firstdeeplearner.html
https://www.import.io/post/history-of-deep-learning/
https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html
http://www.psych.utoronto.ca/users/reingold/courses/ai/cache/neural4.html
https://devblogs.nvidia.com/deep-learning-nutshell-history-training/

41
RESOURCESTransfer Learning
“Learning To Learn”, by Thrun and Pratt (Eds), 1998.
https://www.springer.com/us/book/9780792380474
“Transfer Learning”, by Torrey and Shavlik, 2009.
http://ftp.cs.wisc.edu/machine-learning/shavlik-group/torrey.handbook09.pdf
http://ruder.io/transfer-learning/
“A Survey on Transfer Learning”, by Pan and Yang, 2009.
https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf
http://people.idsia.ch/~juergen/metalearner.html
“Learning to Remember Rare Events”, by Kaiser et al. 2017.
https://arxiv.org/abs/1703.03129

42
RESOURCESPotpourri
https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c
"Are Loss Functions All the Same?", by Rosasco et al. , 2004.
https://www.mitpressjournals.org/doi/10.1162/089976604773135104
“Some Thoughts About The Design Of Loss Functions”, by Hennig and Kutlukaya, 2007.
https://www.ine.pt/revstat/pdf/rs070102.pdf
http://christopher5106.github.io/deep/learning/2016/09/16/about-loss-functions-multinomial-logistic-logarithm-cross-entropy-square-errors-euclidian-absolute-frobenius-hinge.html
“On Loss Functions for Deep Neural Networks in Classification”, by Janocha and Czarnecki, 2017.
“A More General Robust Loss Function”, by Barron, 2018.

Deep Learning for Time Series Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Learning for Time Series Data

Similar to Deep Learning for Time Series Data (20)

More from Arun Kejariwal

More from Arun Kejariwal (20)

Recently uploaded

Recently uploaded (20)

Deep Learning for Time Series Data