Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Learning for Time Series Data


Published on

In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning.

Published in: Technology
  • Be the first to comment

Deep Learning for Time Series Data

  1. 1. Deep Learning for Time Series Data ARUN KEJARIWAL @arun_kejariwal in San Francisco September 2018
  2. 2. 2 About Me Product focus Building and Scaling Teams Advancing the state-of-the-art Scalability Performance
  3. 3. 3 Media Fake News Security Threat Detection Finance Control Systems Malfunction Detection Operations Availability Performance Forecasting Applications Time Series
  4. 4. 4 Rule-based μ ± 3*σ Median, MAD Tests Generalized ESD Underlying assumptions Statistical Forecasting Seasonality Trend Techniques ARIMA, SARIMA Robust Kalman Filter Time Series Analysis Clustering Other Techniques PCA OneSVM Isolation Forests (Un) Supervised Autoencoder Variational Autoencoder LSTM GRU Clockwork RNN Depth Gated RNN Deep Learning Anomaly Detection Required in every application domain
  5. 5. 5 HISTORY Neural Network Research
  6. 6. 6 Recurrent Neural Networks Long History! RTRL TDNN BPTT NARX [Robinson and Fallside 1987] [Waibel 1987] [Werbos 1988] [Lin et al. 1996]
  7. 7. St: hidden state “The LSTM’s main idea is that, instead of compuEng St from St-1 directly with a matrix-vector product followed by a nonlinearity, the LSTM directly computes St, which is then added to St-1 to obtain St.” [Jozefowicz et al. 2015] 7 RNN: Long Short-Term Memory Over 20 yrs! Neural Computation, 1997 * * Figure borrowed from Resistant to vanishing gradient problem Achieve better results when dropout is used Adding bias of 1 to LSTM’s forget gate(a) Forget gate (b) Input gate (c) Output gate
  8. 8. 8 RNN: Long Short-Term Memory Application to Anomaly Detection *Prediction Error Finding pattern anomalies No need for a fixed size window for model estimation Works with non-stationary time series with irregular structure LSTM Encoder-Decoder Explore other extensions of LSTMs such as GRU, Memory Networks, Convolutional LSTM, Quasi-RNN, Dilated RNN, Skip RNN, HF-RNN, Bi-RNN, Zoneout (regularizing RNN), TAGM
  9. 9. 9 Forecasting Financial Translation Machine Synthesis Speech Modeling NLP Sequence Modeling ! ! ! ! [1] “A Critical Review of Recurrent Neural Networks for Sequence Learning”, by Lipton et al., 2015. ( [2] “Sequence to sequence learning with Neural Networks”, by Sutskever et al., 2014. (
  10. 10. 10 Alternatives To RNNs TCN* Temporal Convolutional Network = 1D Fully-Convolueonal Network + Causal convolueons Assumptions and Optimizations “Stability” Dilation Residual Connections Advantages Inference Speed Parallelizability Trainability Feed Forward Models# Gated-Convolutional Language Model Universal Transformer WaveNet ù ù * "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling", by Bai et al. 2018. ( # “When Recurrent Models Don't Need to be Recurrent”, by Miller and Hardt, 2018. ( “… the “infinite memory” advantage of RNNs is largely absent in practice.” “The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history.” [Bai et al. 2018] {
  11. 11. 11 Challenges Going Beyond Large Number of Time Series Up to Hundreds of Millions Dashboarding Impractical Sea of Anomalies Fatigue Root Cause Analysis Long TTD
  12. 12. 12 Multi-variate Analysis Curse of Dimensionality Computationally Expensive Real-time constraints Dimensionality Reduction PCA, SVD Recency Decision making
  13. 13. 13
  14. 14. 14 Correlation Analysis An Example
  15. 15. 15 Correlation Analysis Another Example
  16. 16. 16 Correlation Analysis Surfacing Actionable Insights
  17. 17. 17 Correlation Analysis Surfacing Actionable Insights Correlation matrix Bird’s eye view Not Scalable (O(n2)) Millions of time series/data streams Meaningless correlations Lack of context Systems domain Exploiting topology
  18. 18. 18 Pearson [Pearson 1900] Goodman and Kruskal 𝛾 [Goodman and Kruskal ’54] Kandal 𝜏 [Kendall ‘38] Spearman 𝜌 [Spearman 1904, 1906] Somer’s D [Somer ’62] Cohen’s 𝜅 [Cohen ‘60] Cramer’s V [Cramer '46] Correlation Coefficients Different Flavors
  19. 19. 19 Pearson Correlation A Deep Dive Robustness Sensitive to outliers Amenable to incremental computation Linear correlation Susceptible to Curvature Magnitude of the residuals Not rotation invariant Susceptible to Heteroscedasticity Trade-off Speed Accuracy * Figure borrowed from “Robust Correlation: Theory and Applications”,by Shevlyakov and Oja.
  20. 20. 20 Modalities* Time Series T I A V Text Sentiment Analysis Image Animated Gifs Audio Digital Assistants Video Live Streams H Haptic Robotics “Multimodal Machine Learning: A Survey and Taxonomy”,by Baltrušaitis et al., 2017.
  21. 21. 21 Other Modalities Research Opportunity Smell Taste
  22. 22. 22 Feature Extraction [Fu et al. 2008] Correlation Embedding Analysis Feature Extraction [Fu et al. 2008] Correlational PCA Common Representation Learning [Chandar et al. 2017] Correlational NN Face Detection [Deng et al. 2017] Correlation Loss Object Tracking [Valmadre et al. 2017] CFNet LEVERAGING CORRELATION Deep Learning Context
  23. 23. 23 Loss Functions Different Flavors Class separability of features (minimize interclass correlation) Softmax Loss Improved Triplet Loss [Cheng et al. 2016] Triplet Loss [Schroff et al. 2015] Center Invariant Loss [Wu et al. 2017] Center Loss [Wen et al. 2016] Larger inter-class variaeon and a smaller intra-class variaeon Quadruplet Loss [Chen et al. 2017] Separability and discriminatory ability of features (maximize intraclass correlation) Correlation Loss [Deng et al. 2017]
  24. 24. Correlation Loss Deep Dive Normalization Non-linearly changes the distribution Yields non-Gaussian distribution Uncentered Pearson Correlation Angle similarity Insensitive to anomalies Enlarge margins amongst different classes Softmaxloss Correlationloss Distribution of Deeply Learned Features* * Figure borrowed from [Deng et al. 2017].
  25. 25. 25 CANONICAL CORRELATION ANALYSIS Common Representation Learning Deep Generalized [Benton et al. 2017] Deep Discriminative [Dorfer and Widmer 2016] Deep Variational [Wang et al. 2016] Soft Decorrelation [Chang et al. 2017] Maximize correlation of the views when projected to a common subspace Minimize self and cross-reconstruction error and maximize correlation Leverage CRL for transfer learning - high commercial potential
  26. 26. 26
  27. 27. 27 Spurious Correlation Long History
  28. 28. 28 Spurious Correlation Lack Of Context * * # * #
  29. 29. 29 Nonsense Correlation Long History
  30. 30. 32 Parallel and Distributed Processing Eds. Rumelhart and McClelland, ‘86 Neurocomputing: Foundations of Research Anderson and Rosenfeld 1988 The Roots of Backpropagation Werbos 1994 Neural Networks: A Systematic Introduction Rojas 1996 READINGSSurveys & Books "
  31. 31. 33 READINGSSurveys & Books Deep Learning LeCun et al. 2015 Deep Learning in Neural Networks: An overview Schmidhuber 2015 Deep Learning Goodfellow et al. 2016 Neuro-dynamic programming Bertsekas 1996
  32. 32. 34 [Werbos ’74] Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences [Parker ’82] Learning Logic [Rumelhart, Hinton and Williams ‘86] Learning internal representations by error propagation [Lippmann ’87] An introduction to computing with neural networks
  33. 33. 35 [Widrow and Lehr ’90] 30 Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation [Wang and Raj ’17] On the Origin of Deep Learning [Arulkumaran et al. ’17] A Brief Survey of Deep Reinforcement Learning [Alom et al. ’18] The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches
  34. 34. 36 [Higham and Higham ’18] Deep Learning: An Introduceon for Applied Mathemaecians [Marcus ’18] Deep Learning: A Criecal Appraisal
  35. 35. 37 READINGS Anomaly Detection Understanding anomaly detection “Variational Inference For On-Line Anomaly Detection In High-Dimensional Time Series”, by Sölch et al., 2016. “Anomaly Detection Using Autoencoders with Nonlinear Dimensionality Reduction”, by Sakurada and Yairi 2014. “On the Runtime-Efficacy Trade-off of Anomaly Detection Techniques for Real-Time Streaming Data”, by Choudhary et al., 2017.
  36. 36. 38 READINGS RNNs “Learning to Forget: Continual Prediction with LSTM”, by Gers et al., 2000. “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”, by Chung et al., 2014. “An Empirical Exploration Of Recurrent Network Architectures”, by Jozefowicz et al., 2015. “On the Properties of Neural Machine Translation: Encoder–Decoder Approaches”, by Cho et al., 2014. “Visualizing and Understanding Recurrent Networks”, by Karpathy et al., 2015. “LSTM: A Search Space Odyssey”, by Greff et al., 2017.
  37. 37. 39 READINGS Deep Learning based Multi-View Learning “Deep Multimodal Autoencoders”, by Ngiam et al., 2011. “Extending Long Short-Term Memory for Multi-View Structured Learning”, by Rajagopalan et al., 2016. “Compressing Recurrent Neural Network With Tensor Train”, by Tjandra et al., 2017. “Deep Canonically Correlated Autoencoders”, by Wang et al., 2015. “Multimodal Tensor Fusion Network”, by Zadeh et al., 2017. “Memory Fusion Network for Multi-View Sequential Learning”, by Zadeh et al., 2018.
  38. 38. 40 RESOURCESHistory of Neural Networks
  39. 39. 41 RESOURCESTransfer Learning “Learning To Learn”, by Thrun and Pratt (Eds), 1998. “Transfer Learning”, by Torrey and Shavlik, 2009. “A Survey on Transfer Learning”, by Pan and Yang, 2009. “Learning to Remember Rare Events”, by Kaiser et al. 2017.
  40. 40. 42 RESOURCESPotpourri "Are Loss Functions All the Same?", by Rosasco et al. , 2004. “Some Thoughts About The Design Of Loss Functions”, by Hennig and Kutlukaya, 2007. “On Loss Functions for Deep Neural Networks in Classification”, by Janocha and Czarnecki, 2017. “A More General Robust Loss Function”, by Barron, 2018.