•4 likes•3,164 views

Report

Share

In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting.

Follow

- 1. For Time Series Forecasting ARUN KEJARIWAL IRA COHEN Sequence-2-Sequence Learning
- 2. ABOUT US
- 3. TIME SERIES FORECASTING 3 Meteorology Machine Translation Operations Transportation Econometrics Marketing, Sales Finance Speech Synthesis
- 4. 4 AN EXAMPLE # Figure borrowed from Brockwell and Davis. #
- 5. TITLE HERE # * Heteroscedasticity STRUCTURAL CHARACTERISTICS *FigureborrowedfromHyndmanetal.2015. Changepoint Anomalies, Extreme Values Trend + Seasonality
- 6. FLAVORS TIMES SERIES FORECASTING 6 # Figure borrowed from Tao et al. 2018. #
- 7. [Faullkner, Comstock, Fossum] [Craw] [Brockwell, Davis] [Chatfield] [Bowerman, O’Connell, Koehler] [Granger, Newbold] Long History Research Books
- 8. 8 [Gilchrist] [Hyndman, Athanasopoulos ] [Box et al.] [Wilson, Keating] [Makridarkis et al.] [Mallios] [Montgomery et al.] [Pankratz]
- 10. 10 Seasonality Multiple levels: weekly, monthly, yearly or Non-seasonal (aperiodic) Stationarity Time varying mean and variance (heteroskedasticity), Exogenous shocks Structural Unevenly Spaced, Missing Data, Anomalies, Changepoints, Small sample size, Skewness, Kurtosis, Chaos, Noise Trend Growth, Virality (network eﬀects), Non-linearity PROPERTIES
- 15. 15 BACKPROPAGATION THROUGH TIME # Figure borrowed from Lillicrap and Santoro, 2019. #
- 17. 17 REAL-TIME RECURRENT LEARNING#* # A Learning Algorithm for Continually Running Fully Recurrent Neural Networks [Williams and Zipser, 1989] * A Method for Improving the Real-Time Recurrent Learning Algorithm [Catfolis, 1993]
- 18. UORO A APPROXIMATE RTRL UORO [Unbiased Online Recurrent Optimization] Works in a streaming fashion Online, Memoryless Avoids backtracking through past activations and inputs Low-rank approximation to forward- mode automatic diﬀerentiation Reduced computation and storage KF-RTRL [Kronecker Factored RTRL] Kronecker product decomposition to approximate the gradients Reduces noise in the approximation Asymptotically, smaller by a factor of n Memory requirement equivalent to UORO Higher computation than UORO Not applicable to arbitrary architectures # Unbiased Online Recurrent Optimization [Tallec and Ollivier, 2017] # * Approximating Real-Time Recurrent Learning with Random Kronecker Factors [Mujika et al. 2018] *
- 20. MEMORY-BASED RNN ARCHITECTURES 20 BRNN: Bi-directional RNN [Schuster and Paliwal, 1997] GLU: Gated Linear Unit [Dauphin et al. 2016] Long Short-Term Memory: LSTM [Hochreiter and Schmidhuber, 1996] Gated Recurrent Unit: GRU [Cho et al. 2014] Gated Highway Network: GHN [Zilly et al. 2017]
- 21. Neural Computation, 1997 * Figure borrowed from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (a) Forget gate (b) Input gate (c) Output gate St: hidden state “The LSTM’s main idea is that, instead of compu7ng St from St-1 directly with a matrix-vector product followed by a nonlinearity, the LSTM directly computes St, which is then added to St-1 to obtain St.” [Jozefowicz et al. 2015] Resistant to vanishing gradient problem Achieve better results when dropout is used Adding bias of 1 to LSTM’s forget gate *
- 22. Stacking d RNNs Recurrence depth d LONG CREDIT ASSIGNMENT PATHS Incorporates Highway layers inside the recurrent transition Highway layers in RHNs perform adaptive computation Transform Carry H, T, C: Non-linear transforms Regularization Variational inference based dropout * Figure borrowed from Silly et al. 2017 * *
- 23. 23 NEW FLAVORS OF RNNs # Figure borrowed from https://distill.pub/2016/augmented-rnns/ #
- 24. What caught your eye at first glance? 24
- 25. And this one? 25 * Figure borrowed from Golub et al. 2012
- 26. 26 Psychology, Neuroscience, Cognitive Sciences [1959] [1974] [1956] Span of absolute judgement
- 28. 28 # Figure borrowed from https://distill.pub/2016/augmented-rnns/ # ATTENTION MECHANISM
- 29. 29 ATTENTION MECHANISM # Figure borrowed from Lillicrap and Santoro, 2019. #
- 31. 31 Self Relates diﬀerent positions of a single sequence in order to compute a representation of the same sequence Also referred to as intra-attention Global vs. Local Global: alignment weights at are inferred from the current target state and all the source states Local: alignment weights at are inferred from the current target state and those source states in the window. Soft vs. Hard Soft: Alignment weights are learned and placed “softly” over all patches in the source image Hard: only selects one patch of the image to attend to at a time ATTENTION FAMILY
- 32. ATTENTION-BASED Models 32 Sparse Attentive Backpropagation [Ke et al. 2018] Hierarchical Attention-Based RHN [Tao et al. 2018] Long Short-Term Memory-Networks [Cheng et al. 2016] Self-Attention GAN [Zhang et al. 2018] [A SNAPSHOT]
- 33. 33 HIERARCHICAL ATTENTION-BASED RECURRENT HIGHWAY NETWORK # Figure borrowed from Tao et al. 2018. #
- 34. ✦ Inspired by the cognitive analogy of reminding ๏ Designed to retrieve one or very few past states ✦ Incorporates a diﬀerentiable, sparse (hard) attention mechanism to select from past states 34SPARSE ATTENTIVE BACKTRACKING TCA THROUGH REMINDING # Figure borrowed from Ke et al. 2018. #
- 35. 35 HEALTH CARE # Figure borrowed from Song et al. 2018. Multi-head Attention Additional masking to enable causality Inference Diagnoses, Length of stay Future illness, Mortality Temporal ordering Positional Encoding & Dense interpolation embedding MULTI-VARIATE Sensor measurement, Test results Irregular sampling, Missing values and measurement errors Heterogeneous, Presence of long range dependencies #
- 36. TIME SERIES FORECASTING: ON THE ROLE OF PRE-PROCESSING TO GET IT RIGHT
- 37. Auto ML Trend Anomaly Root Cause Forecast What If Optimization Real-timeNo Code Business Monitoring Business Forecast No Data Scientist ANODOT MISSION: MAKING BI AUTONOMOUS
- 38. GAMING ECOMMERCE AD TECH TELCOMENTERPRISE INTERNET IOTFINTECH SOME OF OUR CUSTOMERS BIG SOCIAL NETWORK
- 39. 4 FINTECH / TREASURY DEPARTMENT TRANSPORTATION / DATA SCIENCE DEPARTMENT How many drivers will I need tomorrow? DEMAND FORECAST GROWTH FORECAST Anticipate demand for inventory, products, service calls and much more. Anticipate revenue growth, expenses, cash flow and other KPIs. How many funds do I need to allocate per currency? Will we hit our targets next quarter? F O R E C A S T U S E C A S E S FINTECH / TREASURY DEPARTMENT X ? X ? TRANSPORTATION / BUSINESS OPERATIONS ALL INDUSTRIES / FINANCE DEPARTMENT
- 40. AI-POWERED FORECASTING IN A TURN-KEY EXPERIENCE
- 41. Correlate with Public Data PRODUCT COMPONENTS
- 42. CONSIDERATION FOR ACCURATE FORECAST Discovering influencing metrics and events 1. Ensemble of models2. Identify and account for data anomalies 3. Identify and account for different time series behaviors 4.
- 43. HOW TO DISCOVER INFLUENCING METRICS/EVENTS? • Target time series + forecast horizon • Millions of measures/events that can used as features INPUT: • Step 1 is computationally expensive for long sequences: Use LSH for speed • Which correlation function to use? CHALLENGES: STEP 1 Compute correlation between target and each measure/event (shifted by the horizons) STEP 2 Choose X most correlated measures STEP 3 Train forecast model PROCEDURE:
- 44. THE EFFECT ON ACCURACY
- 45. IDENTIFYING AND ACCOUNTING FOR DATA ANOMALIES ANOMALIES DEGRADE FORECASTING ACCURACY How to remedy the situation? Discover anomalies and use the information to create new features: Case 1: Anomalies can be explained by external factors – enhance the anomalies Case 2: Anomalies can’t be explained by external factors – weight down the anomalies
- 46. • • • IDENTIFYING AND ACCOUNTING FOR DATA ANOMALIES Case 1: Anomalies can be explained by external factors – enhance the anomalies Case 2: Anomalies can’t be explained by external factors – weight down the anomalies Discover anomalies and use the information to create new features: SOLUTION:
- 47. IDENTIFYING AND ACCOUNTING FOR DATA ANOMALIES: RESULT 1-15% accuracy improvement
- 48. Varying behaviors: • Seasonality (length/strength) • Stationarity • Trends • Sparseness • Spikiness • … length trend Seasonal strength linearity curvature e_acf1 e_acf10 peak trough stability entropy x_acf1 x_acf10 diff1_acf1 diff1_acf10 diff2_acf1 diff2_acf10 seas_acf1 arch_acf garch_acf arch_r2 garch_r2 hurst lumpiness spike max_level_shift time_level_shift max_var_shift time_var_shift max_kl_shift time_kl_shift unitroot_kpss unitroot_pp x_pacf5 diff1x_pacf5 diff2x_pacf5 seas_pacf crossing_points flat_spots *Rob Hyndman - tsfeatures https://pkg.robjhyndman.com/tsfeatures Handling Variations In Time Series Behaviors
- 49. POTENTIAL ADVANTAGES ● Train one model for many time series ● Less data required per time series OPEN QUESTIONS ● Will a single model be more accurate than individual ones? ● Which types of differing behaviors impact the ability to train a single model adversely, and which do not? A SINGLE MODEL FOR THEM ALL?
- 50. A SINGLE MODEL FOR THEM ALL? TESTING THE IMPACT OF EACH BEHAVIOR TYPE LSTMsLSTMsLSTM for each TS One LSTM for all TS LSTMsLSTMsLSTM for each TS One LSTM for all TS One LSTM for all TS Train Benchmark Forecast Horizontal line Bench mark loss (absolute error) Score 5 (model loss/ benchmark loss) Score 4 Score 3 Score 2 Score 1 Dataset Compute strength of behavior for each TS High strength TS Low strength TS
- 51. (by feature) Score 5 high Score 3 low/high Score 1 low Score 2 low Score 4 high Impact of the behavior to mixed training Impact of the behavior on ability to forecast A SINGLE MODEL FOR THEM ALL? TESTING THE IMPACT OF EACH BEHAVIOR TYPE
- 52. Impact on accuracy for joint training Impact on accuracy for variability of the behavior seasonal_strength curvature x_pacf5 linearity hurst x_acf1 entropy max_level_shift time_level_shift max_var_shift time_kl_shift unitroot_kpss unitroot_pp seasonal frequency arch acf garch_acf seas_pacf trough peak stability lumpiness diff2_acf10 e_acf10 diff1_acf10 x_acf10 arch_acf max_kl_shift 2 Seasonality Homodesdacity 1
- 53. MAIN CONCLUSIONS / SOLUTIONS Impact on accuracy for joint training Impact on accuracy for variability of the behavior ● Two main factors preventing simple training of single models ● Seasonality: The frequency is the important factor, no shape ● homoscedasticity (same variance): prevents mixing, but strength of it impacts accuracy overall ● Other behaviors have lower mixing impact SOLUTIONS ● Separate TS for training based on behavior ● Embed behavior related features for a single model training.
- 54. Requires efficient feature selection 1. Preprocessing before training boosts forecast accuracy 2. Seasonality and homoscedasticity are the key behaviors impacting ability to train joint models 3. KEY TAKEAWAYS Discovering influencing metrics and events 1. Identify and account for data anomalies 2. Identify and account for different time series behaviors 3.
- 55. Thank you 36
- 56. READINGS 37 [Rosenblatt] Principles of Neurodynamics: Perceptrons and the theory of brain mechanisms [Eds. Anderson and Rosenfeld] Neurocomputing: Foundations of Research [Eds. Rumelhart and McClelland] Parallel and Distributed Processing [Werbos] The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting [Eds. Chauvin and Rumelhart] Backpropagation: Theory, Architectures and Applications [Rojas] Neural Networks: A Systematic Introduction [BOOKS]
- 57. READINGS 38 Perceptrons [Minsky and Papert, 1969] Une procedure d'apprentissage pour reseau a seuil assymetrique [Le Cun, 1985] The problem of serial order in behavior [Lashley, 1951] Beyond regression: New tools for prediction and analysis in the behavioral sciences [Werbos, 1974] Connectionist models and their properties [Feldman and Ballard, 1982] Learning-logic [Parker, 1985] [EARLY WORKS]
- 58. READINGS 39 Learning internal representations by error propagation [Rumelhart, Hinton, and Williams, Chapter 8 in D. Rumelhart and F. McClelland, Eds., Parallel Distributed Processing, Vol. 1, 1986] (Generalized Delta Rule) Generalization of backpropagation with application to a recurrent gas market model [Werbos, 1988] Generalization of backpropagation to recurrent and higher order networks [Pineda, 1987] Backpropagation in perceptrons with feedback [Almeida, 1987] Second-order backpropagation: Implementing an optimal O(n) approximation to Newton's method in an artificial neural network [Parker, 1987] Learning phonetic features using connectionist networks: an experiment in speech recognition [Watrous and Shastri, 1987] (Time-delay NN) [BACKPROPAGATION]
- 59. READINGS 40 Backpropagation: Past and future [Werbos, 1988] Adaptive state representation and estimation using recurrent connectionist networks [Williams, 1990] Generalization of back propagation to recurrent and higher order neural networks [Pineda, 1988] Learning state space trajectories in recurrent neural networks [Pearlmutter 1989] Parallelism, hierarchy, scaling in time-delay neural networks for spotting Japanese phonemes/CV-syllables [Sawai et al. 1989] The role of time in natural intelligence: implications for neural network and artificial intelligence research [Klopf and Morgan, 1990] [BACKPROPAGATION]
- 60. READINGS 41 Recurrent Neural Network Regularization [Zaremba et al. 2014] Regularizing RNNs by Stabilizing Activations [Krueger and Memisevic, 2016] Sampling-based Gradient Regularization for Capturing Long-Term Dependencies in Recurrent Neural Networks [Chernodub and Nowicki 2016] A Theoretically Grounded Application of Dropout in Recurrent Neural Networks [Gal and Ghahramani, 2016] Noisin: Unbiased Regularization for Recurrent Neural Networks [Dieng et al. 2018] State-Regularized Recurrent Neural Networks [Wang and Niepert, 2019] [REGULARIZATION of RNNs]
- 61. READINGS 42 A Decomposable Attention Model for Natural Language Inference [Parikh et al. 2016] Hybrid Computing Using A Neural Network With Dynamic External Memory [Graves et al. 2017] Image Transformer [Parmar et al. 2018] Universal Transformers [Dehghani et al. 2019] The Evolved Transformer [So et al. 2019] Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [Dai et al. 2019] [ATTENTION & TRANSFORMERS]
- 62. READINGS 43 Financial Time Series Prediction using hybrids of Chaos Theory, Multi-layer Perceptron and Multi-objective Evolutionary Algorithms [Ravi et al. 2017] Model-free Prediction of Noisy Chaotic Time Series by Deep Learning [Yeo, 2017] DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks [Salinas et al. 2017] Real-Valued (Medical) Time Series Generation With Recurrent Conditional GANs [Hyland et al. 2017] R2N2: Residual Recurrent Neural Networks for Multivariate Time Series Forecasting [Goel et al. 2017] Temporal Pattern Attention for Multivariate Time Series Forecasting [Shih et al. 2018] [TIME SERIES PREDICTION]
- 63. READINGS 44 Unbiased Online Recurrent Optimization [Tallec and Ollivier, 2017] Approximating real-time recurrent learning with random Kronecker factors [Mujika et al. 2018] Theory and Algorithms for Forecasting Time Series [Kuznetsov and Mohri, 2018] Foundations of Sequence-to-Sequence Modeling for Time Series [Kuznetsov and Meriet, 2018] On the Variance Unbiased Recurrent Optimization [Cooijmans and Martens, 2019] Backpropagation through time and the brain [Lillicrap and Santoro, 2019] [POTPOURRI]
- 64. RESOURCES 45 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ http://karpathy.github.io/2015/05/21/rnn-eﬀectiveness/ A review of Dropout as applied to RNNs https://medium.com/@bingobee01/a-review-of-dropout-as-applied-to-rnns-72e79ecd5b7b https://distill.pub/2016/augmented-rnns/ https://distill.pub/2019/memorization-in-rnns/ https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html Using the latest advancements in deep learning to predict stock price movements https://towardsdatascience.com/aifortrading-2edd6fac689d How to Use Weight Regularization with LSTM Networks for Time Series Forecasting https://machinelearningmastery.com/use-weight-regularization-lstm-networks-time-series-forecasting/