In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning.
6. 6
Recurrent Neural Networks
Long History!
RTRL
TDNN
BPTT
NARX
[Robinson and Fallside 1987]
[Waibel 1987]
[Werbos 1988]
[Lin et al. 1996]
7. St: hidden state
“The LSTM’s main idea is that, instead of compuEng St from St-1
directly with a matrix-vector product followed by a nonlinearity,
the LSTM directly computes St, which is then added to St-1 to
obtain St.” [Jozefowicz et al. 2015]
7
RNN: Long Short-Term Memory
Over 20 yrs!
Neural Computation, 1997
*
* Figure borrowed from http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Resistant to vanishing gradient problem
Achieve better results when dropout is used
Adding bias of 1 to LSTM’s forget gate(a) Forget gate (b) Input gate
(c) Output gate
8. 8
RNN: Long Short-Term Memory
Application to Anomaly Detection
*Prediction Error
Finding pattern anomalies
No need for a fixed size window for model estimation
Works with non-stationary time series with irregular
structure
LSTM Encoder-Decoder
Explore other extensions of LSTMs such as
GRU, Memory Networks, Convolutional LSTM,
Quasi-RNN, Dilated RNN, Skip RNN, HF-RNN,
Bi-RNN, Zoneout (regularizing RNN), TAGM
10. 10
Alternatives
To RNNs
TCN*
Temporal Convolutional Network
=
1D Fully-Convolueonal Network
+
Causal convolueons
Assumptions and Optimizations
“Stability”
Dilation
Residual Connections
Advantages
Inference Speed
Parallelizability
Trainability
Feed Forward Models#
Gated-Convolutional Language Model
Universal Transformer
WaveNet
ù
ù
* "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", by Bai et al. 2018. (https://arxiv.org/pdf/1803.01271.pdf)
# “When Recurrent Models Don't Need to be Recurrent”, by Miller and Hardt, 2018. (https://arxiv.org/pdf/1805.10369.pdf)
“… the “infinite memory” advantage of RNNs is largely absent in practice.”
“The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history.”
[Bai et al. 2018]
{
11. 11
Challenges
Going Beyond
Large Number of
Time Series
Up to Hundreds of
Millions
Dashboarding
Impractical
Sea of
Anomalies
Fatigue
Root Cause
Analysis
Long TTD
12. 12
Multi-variate Analysis
Curse of Dimensionality
Computationally Expensive
Real-time constraints
Dimensionality Reduction
PCA, SVD
Recency
Decision making
17. 17
Correlation Analysis
Surfacing Actionable Insights
Correlation matrix
Bird’s eye view
Not Scalable (O(n2))
Millions of time series/data streams
Meaningless correlations
Lack of context
Systems domain
Exploiting topology
18. 18
Pearson
[Pearson 1900]
Goodman and Kruskal 𝛾
[Goodman and Kruskal ’54]
Kandal 𝜏
[Kendall ‘38]
Spearman 𝜌
[Spearman 1904, 1906]
Somer’s D
[Somer ’62]
Cohen’s 𝜅
[Cohen ‘60]
Cramer’s V
[Cramer '46]
Correlation Coefficients
Different Flavors
19. 19
Pearson Correlation
A Deep Dive
Robustness
Sensitive to outliers
Amenable to incremental computation
Linear correlation
Susceptible to
Curvature
Magnitude of the residuals
Not rotation invariant
Susceptible to Heteroscedasticity
Trade-off
Speed
Accuracy
* Figure borrowed from “Robust Correlation: Theory and Applications”,by Shevlyakov and Oja.
20. 20
Modalities*
Time Series
T I
A V
Text
Sentiment Analysis
Image
Animated Gifs
Audio
Digital Assistants
Video
Live Streams
H Haptic
Robotics
“Multimodal Machine Learning: A Survey and Taxonomy”,by Baltrušaitis et al., 2017.
22. 22
Feature Extraction
[Fu et al. 2008]
Correlation
Embedding Analysis
Feature Extraction
[Fu et al. 2008]
Correlational PCA
Common Representation Learning
[Chandar et al. 2017]
Correlational NN
Face Detection
[Deng et al. 2017]
Correlation Loss
Object Tracking
[Valmadre et al. 2017]
CFNet
LEVERAGING CORRELATION
Deep Learning Context
23. 23
Loss Functions
Different Flavors
Class separability of features (minimize interclass correlation)
Softmax Loss
Improved Triplet Loss [Cheng et al. 2016]
Triplet Loss [Schroff et al. 2015]
Center Invariant Loss [Wu et al. 2017]
Center Loss [Wen et al. 2016]
Larger inter-class variaeon and a smaller intra-class variaeon
Quadruplet Loss [Chen et al. 2017]
Separability and discriminatory ability of features
(maximize intraclass correlation)
Correlation Loss [Deng et al. 2017]
24. Correlation Loss
Deep Dive
Normalization
Non-linearly changes the distribution
Yields non-Gaussian distribution
Uncentered Pearson Correlation
Angle similarity
Insensitive to anomalies
Enlarge margins amongst different classes
Softmaxloss
Correlationloss
Distribution of Deeply Learned Features*
* Figure borrowed from [Deng et al. 2017].
25. 25
CANONICAL CORRELATION ANALYSIS
Common Representation Learning
Deep Generalized
[Benton et al. 2017]
Deep Discriminative
[Dorfer and Widmer 2016]
Deep Variational
[Wang et al. 2016]
Soft Decorrelation
[Chang et al. 2017]
Maximize correlation of the views when projected to a common subspace
Minimize self and cross-reconstruction error and maximize correlation
Leverage CRL for transfer learning - high commercial potential
32. 32
Parallel and
Distributed Processing
Eds. Rumelhart and
McClelland, ‘86
Neurocomputing:
Foundations of Research
Anderson and
Rosenfeld 1988
The Roots of
Backpropagation
Werbos 1994
Neural Networks: A
Systematic Introduction
Rojas 1996
READINGSSurveys & Books
"
33. 33
READINGSSurveys & Books
Deep Learning
LeCun et al. 2015
Deep Learning in Neural Networks: An overview
Schmidhuber 2015
Deep Learning
Goodfellow et al. 2016
Neuro-dynamic programming
Bertsekas 1996
34. 34
[Werbos ’74]
Beyond Regression: New Tools for Prediction
and Analysis in the Behavioral Sciences
[Parker ’82]
Learning Logic
[Rumelhart, Hinton and Williams ‘86]
Learning internal representations by error
propagation
[Lippmann ’87]
An introduction to computing with neural
networks
35. 35
[Widrow and Lehr ’90]
30 Years of Adaptive Neural Networks:
Perceptron, Madaline, and Backpropagation
[Wang and Raj ’17]
On the Origin of Deep Learning
[Arulkumaran et al. ’17]
A Brief Survey of Deep Reinforcement Learning
[Alom et al. ’18]
The History Began from AlexNet: A
Comprehensive Survey on Deep
Learning Approaches
36. 36
[Higham and Higham ’18]
Deep Learning: An Introduceon
for Applied Mathemaecians
[Marcus ’18]
Deep Learning: A Criecal Appraisal
37. 37
READINGS
Anomaly Detection
Understanding anomaly detection
safaribooksonline.com/library/view/understanding-anomaly-deteceon/9781491983676/
https://www.slideshare.net/arunkejariwal/anomaly-detection-in-realtime-data-streams-using-heron
“Variational Inference For On-Line Anomaly Detection In High-Dimensional Time Series”, by Sölch et al., 2016.
https://arxiv.org/pdf/1602.07109.pdf
https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265
“Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction”, by Sakurada and Yairi 2014.
https://dl.acm.org/citation.cfm?id=2689747
“On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-Time Streaming Data”, by Choudhary et al., 2017.
https://arxiv.org/abs/1710.04735
38. 38
READINGS
RNNs
“Learning to Forget: Continual Prediction with LSTM”, by Gers et al., 2000.
“Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, by Chung et al., 2014.
“An Empirical Exploration Of Recurrent Network Architectures”, by Jozefowicz et al., 2015.
“On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”, by Cho et al., 2014.
“Visualizing and Understanding Recurrent Networks”, by Karpathy et al., 2015.
“LSTM: A Search Space Odyssey”, by Greff et al., 2017.
39. 39
READINGS
Deep Learning based Multi-View Learning
“Deep Multimodal Autoencoders”, by Ngiam et al., 2011.
“Extending Long Short-Term Memory for Multi-View Structured Learning”, by Rajagopalan et al., 2016.
“Compressing Recurrent Neural Network With Tensor Train”, by Tjandra et al., 2017.
“Deep Canonically Correlated Autoencoders”, by Wang et al., 2015.
“Multimodal Tensor Fusion Network”, by Zadeh et al., 2017.
“Memory Fusion Network for Multi-View Sequential Learning”, by Zadeh et al., 2018.
41. 41
RESOURCESTransfer Learning
“Learning To Learn”, by Thrun and Pratt (Eds), 1998.
https://www.springer.com/us/book/9780792380474
“Transfer Learning”, by Torrey and Shavlik, 2009.
http://ftp.cs.wisc.edu/machine-learning/shavlik-group/torrey.handbook09.pdf
http://ruder.io/transfer-learning/
“A Survey on Transfer Learning”, by Pan and Yang, 2009.
https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf
http://people.idsia.ch/~juergen/metalearner.html
“Learning to Remember Rare Events”, by Kaiser et al. 2017.
https://arxiv.org/abs/1703.03129
42. 42
RESOURCESPotpourri
https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c
"Are Loss Functions All the Same?", by Rosasco et al. , 2004.
https://www.mitpressjournals.org/doi/10.1162/089976604773135104
“Some Thoughts About The Design Of Loss Functions”, by Hennig and Kutlukaya, 2007.
https://www.ine.pt/revstat/pdf/rs070102.pdf
http://christopher5106.github.io/deep/learning/2016/09/16/about-loss-functions-multinomial-logistic-logarithm-cross-entropy-square-errors-euclidian-absolute-frobenius-hinge.html
“On Loss Functions for Deep Neural Networks in Classification”, by Janocha and Czarnecki, 2017.
https://arxiv.org/pdf/1702.05659.pdf
“A More General Robust Loss Function”, by Barron, 2018.
https://arxiv.org/pdf/1701.03077.pdf