Sep. 11, 2018•0 likes•1,638 views

Report

Technology

In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning.

Arun KejariwalFollow

- 1. Deep Learning for Time Series Data ARUN KEJARIWAL @arun_kejariwal TheAIconf.com in San Francisco September 2018
- 2. 2 About Me Product focus Building and Scaling Teams Advancing the state-of-the-art Scalability Performance
- 3. 3 Media Fake News Security Threat Detection Finance Control Systems Malfunction Detection Operations Availability Performance Forecasting Applications Time Series
- 4. 4 Rule-based μ ± 3*σ Median, MAD Tests Generalized ESD Underlying assumptions Statistical Forecasting Seasonality Trend Techniques ARIMA, SARIMA Robust Kalman Filter Time Series Analysis Clustering Other Techniques PCA OneSVM Isolation Forests (Un) Supervised Autoencoder Variational Autoencoder LSTM GRU Clockwork RNN Depth Gated RNN Deep Learning Anomaly Detection Required in every application domain
- 5. 5 HISTORY Neural Network Research
- 6. 6 Recurrent Neural Networks Long History! RTRL TDNN BPTT NARX [Robinson and Fallside 1987] [Waibel 1987] [Werbos 1988] [Lin et al. 1996]
- 7. St: hidden state “The LSTM’s main idea is that, instead of compuEng St from St-1 directly with a matrix-vector product followed by a nonlinearity, the LSTM directly computes St, which is then added to St-1 to obtain St.” [Jozefowicz et al. 2015] 7 RNN: Long Short-Term Memory Over 20 yrs! Neural Computation, 1997 * * Figure borrowed from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Resistant to vanishing gradient problem Achieve better results when dropout is used Adding bias of 1 to LSTM’s forget gate(a) Forget gate (b) Input gate (c) Output gate
- 8. 8 RNN: Long Short-Term Memory Application to Anomaly Detection *Prediction Error Finding pattern anomalies No need for a fixed size window for model estimation Works with non-stationary time series with irregular structure LSTM Encoder-Decoder Explore other extensions of LSTMs such as GRU, Memory Networks, Convolutional LSTM, Quasi-RNN, Dilated RNN, Skip RNN, HF-RNN, Bi-RNN, Zoneout (regularizing RNN), TAGM
- 9. 9 Forecasting Financial Translation Machine Synthesis Speech Modeling NLP Sequence Modeling ! ! ! ! [1] “A Critical Review of Recurrent Neural Networks for Sequence Learning”, by Lipton et al., 2015. (https://arxiv.org/abs/1506.00019) [2] “Sequence to sequence learning with Neural Networks”, by Sutskever et al., 2014. (https://arxiv.org/abs/1409.3215)
- 10. 10 Alternatives To RNNs TCN* Temporal Convolutional Network = 1D Fully-Convolueonal Network + Causal convolueons Assumptions and Optimizations “Stability” Dilation Residual Connections Advantages Inference Speed Parallelizability Trainability Feed Forward Models# Gated-Convolutional Language Model Universal Transformer WaveNet ù ù * "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", by Bai et al. 2018. (https://arxiv.org/pdf/1803.01271.pdf) # “When Recurrent Models Don't Need to be Recurrent”, by Miller and Hardt, 2018. (https://arxiv.org/pdf/1805.10369.pdf) “… the “infinite memory” advantage of RNNs is largely absent in practice.” “The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history.” [Bai et al. 2018] ｛
- 11. 11 Challenges Going Beyond Large Number of Time Series Up to Hundreds of Millions Dashboarding Impractical Sea of Anomalies Fatigue Root Cause Analysis Long TTD
- 12. 12 Multi-variate Analysis Curse of Dimensionality Computationally Expensive Real-time constraints Dimensionality Reduction PCA, SVD Recency Decision making
- 13. 13
- 14. 14 Correlation Analysis An Example
- 15. 15 Correlation Analysis Another Example
- 16. 16 Correlation Analysis Surfacing Actionable Insights
- 17. 17 Correlation Analysis Surfacing Actionable Insights Correlation matrix Bird’s eye view Not Scalable (O(n2)) Millions of time series/data streams Meaningless correlations Lack of context Systems domain Exploiting topology
- 18. 18 Pearson [Pearson 1900] Goodman and Kruskal 𝛾 [Goodman and Kruskal ’54] Kandal 𝜏 [Kendall ‘38] Spearman 𝜌 [Spearman 1904, 1906] Somer’s D [Somer ’62] Cohen’s 𝜅 [Cohen ‘60] Cramer’s V [Cramer '46] Correlation Coefficients Different Flavors
- 19. 19 Pearson Correlation A Deep Dive Robustness Sensitive to outliers Amenable to incremental computation Linear correlation Susceptible to Curvature Magnitude of the residuals Not rotation invariant Susceptible to Heteroscedasticity Trade-off Speed Accuracy * Figure borrowed from “Robust Correlation: Theory and Applications”,by Shevlyakov and Oja.
- 20. 20 Modalities* Time Series T I A V Text Sentiment Analysis Image Animated Gifs Audio Digital Assistants Video Live Streams H Haptic Robotics “Multimodal Machine Learning: A Survey and Taxonomy”,by Baltrušaitis et al., 2017.
- 21. 21 Other Modalities Research Opportunity Smell Taste
- 22. 22 Feature Extraction [Fu et al. 2008] Correlation Embedding Analysis Feature Extraction [Fu et al. 2008] Correlational PCA Common Representation Learning [Chandar et al. 2017] Correlational NN Face Detection [Deng et al. 2017] Correlation Loss Object Tracking [Valmadre et al. 2017] CFNet LEVERAGING CORRELATION Deep Learning Context
- 23. 23 Loss Functions Different Flavors Class separability of features (minimize interclass correlation) Softmax Loss Improved Triplet Loss [Cheng et al. 2016] Triplet Loss [Schroff et al. 2015] Center Invariant Loss [Wu et al. 2017] Center Loss [Wen et al. 2016] Larger inter-class variaeon and a smaller intra-class variaeon Quadruplet Loss [Chen et al. 2017] Separability and discriminatory ability of features (maximize intraclass correlation) Correlation Loss [Deng et al. 2017]
- 24. Correlation Loss Deep Dive Normalization Non-linearly changes the distribution Yields non-Gaussian distribution Uncentered Pearson Correlation Angle similarity Insensitive to anomalies Enlarge margins amongst different classes Softmaxloss Correlationloss Distribution of Deeply Learned Features* * Figure borrowed from [Deng et al. 2017].
- 25. 25 CANONICAL CORRELATION ANALYSIS Common Representation Learning Deep Generalized [Benton et al. 2017] Deep Discriminative [Dorfer and Widmer 2016] Deep Variational [Wang et al. 2016] Soft Decorrelation [Chang et al. 2017] Maximize correlation of the views when projected to a common subspace Minimize self and cross-reconstruction error and maximize correlation Leverage CRL for transfer learning - high commercial potential
- 26. 26
- 27. 27 Spurious Correlation Long History
- 28. 28 Spurious Correlation Lack Of Context * http://www.tylervigen.com/spurious-correlations * # * http://www.tylervigen.com/spurious-correlations #
- 29. 29 Nonsense Correlation Long History
- 32. 32 Parallel and Distributed Processing Eds. Rumelhart and McClelland, ‘86 Neurocomputing: Foundations of Research Anderson and Rosenfeld 1988 The Roots of Backpropagation Werbos 1994 Neural Networks: A Systematic Introduction Rojas 1996 READINGSSurveys & Books "
- 33. 33 READINGSSurveys & Books Deep Learning LeCun et al. 2015 Deep Learning in Neural Networks: An overview Schmidhuber 2015 Deep Learning Goodfellow et al. 2016 Neuro-dynamic programming Bertsekas 1996
- 34. 34 [Werbos ’74] Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences [Parker ’82] Learning Logic [Rumelhart, Hinton and Williams ‘86] Learning internal representations by error propagation [Lippmann ’87] An introduction to computing with neural networks
- 35. 35 [Widrow and Lehr ’90] 30 Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation [Wang and Raj ’17] On the Origin of Deep Learning [Arulkumaran et al. ’17] A Brief Survey of Deep Reinforcement Learning [Alom et al. ’18] The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches
- 36. 36 [Higham and Higham ’18] Deep Learning: An Introduceon for Applied Mathemaecians [Marcus ’18] Deep Learning: A Criecal Appraisal
- 37. 37 READINGS Anomaly Detection Understanding anomaly detection safaribooksonline.com/library/view/understanding-anomaly-deteceon/9781491983676/ https://www.slideshare.net/arunkejariwal/anomaly-detection-in-realtime-data-streams-using-heron “Variational Inference For On-Line Anomaly Detection In High-Dimensional Time Series”, by Sölch et al., 2016. https://arxiv.org/pdf/1602.07109.pdf https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265 “Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction”, by Sakurada and Yairi 2014. https://dl.acm.org/citation.cfm?id=2689747 “On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-Time Streaming Data”, by Choudhary et al., 2017. https://arxiv.org/abs/1710.04735
- 38. 38 READINGS RNNs “Learning to Forget: Continual Prediction with LSTM”, by Gers et al., 2000. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, by Chung et al., 2014. “An Empirical Exploration Of Recurrent Network Architectures”, by Jozefowicz et al., 2015. “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”, by Cho et al., 2014. “Visualizing and Understanding Recurrent Networks”, by Karpathy et al., 2015. “LSTM: A Search Space Odyssey”, by Greﬀ et al., 2017.
- 39. 39 READINGS Deep Learning based Multi-View Learning “Deep Multimodal Autoencoders”, by Ngiam et al., 2011. “Extending Long Short-Term Memory for Multi-View Structured Learning”, by Rajagopalan et al., 2016. “Compressing Recurrent Neural Network With Tensor Train”, by Tjandra et al., 2017. “Deep Canonically Correlated Autoencoders”, by Wang et al., 2015. “Multimodal Tensor Fusion Network”, by Zadeh et al., 2017. “Memory Fusion Network for Multi-View Sequential Learning”, by Zadeh et al., 2018.
- 40. 40 RESOURCESHistory of Neural Networks http://www.andreykurenkov.com/writing/ai/a-brief-history-of-neural-nets-and-deep-learning/ http://people.idsia.ch/~juergen/firstdeeplearner.html https://www.import.io/post/history-of-deep-learning/ https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html http://www.psych.utoronto.ca/users/reingold/courses/ai/cache/neural4.html https://devblogs.nvidia.com/deep-learning-nutshell-history-training/
- 41. 41 RESOURCESTransfer Learning “Learning To Learn”, by Thrun and Pratt (Eds), 1998. https://www.springer.com/us/book/9780792380474 “Transfer Learning”, by Torrey and Shavlik, 2009. http://ftp.cs.wisc.edu/machine-learning/shavlik-group/torrey.handbook09.pdf http://ruder.io/transfer-learning/ “A Survey on Transfer Learning”, by Pan and Yang, 2009. https://www.cse.ust.hk/~qyang/Docs/2009/tkde_transfer_learning.pdf http://people.idsia.ch/~juergen/metalearner.html “Learning to Remember Rare Events”, by Kaiser et al. 2017. https://arxiv.org/abs/1703.03129
- 42. 42 RESOURCESPotpourri https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c "Are Loss Functions All the Same?", by Rosasco et al. , 2004. https://www.mitpressjournals.org/doi/10.1162/089976604773135104 “Some Thoughts About The Design Of Loss Functions”, by Hennig and Kutlukaya, 2007. https://www.ine.pt/revstat/pdf/rs070102.pdf http://christopher5106.github.io/deep/learning/2016/09/16/about-loss-functions-multinomial-logistic-logarithm-cross-entropy-square-errors-euclidian-absolute-frobenius-hinge.html “On Loss Functions for Deep Neural Networks in Classification”, by Janocha and Czarnecki, 2017. https://arxiv.org/pdf/1702.05659.pdf “A More General Robust Loss Function”, by Barron, 2018. https://arxiv.org/pdf/1701.03077.pdf