Report

Share

Follow

•24 likes•5,122 views

•24 likes•5,122 views

Report

Share

Statistical Learning Based Anomaly Detection @ Twitter

Follow

- 1. Sta$s$cal Learning Based Anomaly Detec$on @ Twi9er Arun Kejariwal (@arun_kejariwal) Joint work with Jordan Hochenbaum and Owen Vallis November 2014
- 2. Internet trends • Real-time [1] h9p://techcrunch.com/2014/05/05/amazon-‐extends-‐its-‐shopping-‐cart-‐to-‐twi9er/ AK 2 [1]
- 3. Twi9er: Global Town Square AK 3
- 4. Data Fidelity • Data-driven decision making q Evolving product landscape • Data partners q Nielsen q Dataminr • Operational q Performance and Availability AK 4
- 5. Data Fidelity: Challenges • Anomalies q Exogenic factors § User behavior § Events § Data center q Endogenic factors § Agile development o Fail fast § Data collection • Millions of time series [1,2] q Scalability AK 5 [1] h9p://strata.oreilly.com/2013/09/how-‐twi9er-‐monitors-‐millions-‐of-‐$me-‐series.html [2] h9p://strataconf.com/strata2014/public/schedule/detail/32431
- 6. Anomaly Detec$on: Why Bother? • Analyze User Engagement q Events § Super Bowl, Japanese New Year q Year over year analysis (input to forecasting) • Identify Attacks q DoS q Malware attacks • Identify Bots q Separating actual users from spam AK 6
- 7. Anomaly Detec$on • Visual q Prone to errors q Not scalable § Machine generated data 11% of the digital universe in 2005 to > 40% by 2020 [1] § Cloud Infrastructure 2013-2017 CAGR ~50% [2] • Algorithmic approach q Automate! [1] h9p://www.emc.com/about/news/press/2012/20121211-‐01.htm AK 7 [2] h9p://www.forbes.com/sites/gilpress/2013/12/12/16-‐1-‐billion-‐big-‐data-‐market-‐2014-‐predic$ons-‐from-‐idc-‐and-‐iia/
- 8. Anomaly Detec$on: Background • Over 50 years of research [1] q Statistics § Extreme Value Theory § Robust Statistics, Grubb’s Test, ESD q Econometrics q Finance § Value at Risk (VaR) q Signal Processing q Music Information Retrieval q Networking q E- Commerce q Performance Regression [1] “Anomaly Detec$on” by Chandola et al. ACM Compu$ng Surveys, 2009. AK 8 Jon from Etsy Toufic from Metafor
- 9. Anomaly Detec$on: Overview • Definition q “An anomaly is an observation that deviates so much from other observations so as to arouse suspicions that it is was generated by a different mechanism” [1,2] [1] “Iden$fica$on of outliers” by Hawkins, Douglas M. London: Chapman and Hall, 1980. AK 9 [2] “Outlier Analysis” by Charu C. Aggarwal. Springer, 2013.
- 10. Anomaly Detec$on • Characterization q Magnitude q Width q Frequency q Direction AK 10
- 11. Anomaly Detec$on (contd.) • Two flavors q Global § Max Value q Local § Intra-day AK 11 Global Local
- 12. Anomaly Detec$on (contd.) • Traditional Approaches q Metrics § Mean μ § Variance σ q Rule of thumb § μ + 3*σ q Which time series? § Raw § Moving Averages o SMA, EWMA, PEWMA AK 12 3 * σ
- 13. Anomaly Detec$on (contd.) • Impact of multi-modal distribution q μ Shift ~ 0.2% q Inflates σ by 4.5% § Miss quite a few anomalies q What do multiple modes correspond to? § Seasonality AK 13
- 14. • Robust Statistics q MAD § Robust Breakdown point o Median 50% vs. Mean 0% q σMAD § K = 1.4826 for normally distributed data AK 14 Anomaly Detec$on (contd.)
- 15. • Limitations of using MAD AK 15 Anomaly Detec$on (contd.)
- 16. • Grubb’s Test q Critical value is derived from data using a statistical confidence (α) • Limitations q Assumes data distribution is normal q Good for detecting ONLY 1 outlier q Seasonality unaware AK 16 Anomaly Detec$on (contd.)
- 17. • ESD (Generalized Extreme Studentized Deviate) [1] q Critical value (λi) re-calculated every iteration q Largest i such that Ri > λi determines # of anomalies q An upper-bound on the number of anomalies is an input parameter • Limitations q Generalized ESD assumes a “normal” distribution q Seasonality unaware AK 17 Anomaly Detec$on (contd.) [1] Rosner, Bernard. “Percentage Points for a Generalized ESD Many-‐outlier Procedure.” Technometrics 25, no. 2 (1983): 165–172.
- 18. Our Approach
- 19. • Addressing Seasonality q Key Idea § Time Series Decomposition AK 19 Anomaly Detec$on (contd.)
- 20. • Determining seasonal component q Regression on sub-cycle plots [1] AK 20 Anomaly Detec$on (contd.) [1] “STL: A seasonal-‐trend decomposi$on procedure based on loess” by Cleveland, et al. Journal of Official Sta$s$cs, Vol. 6, Issue 1, 1990.
- 21. • Impact of removal of seasonal and trend q Transforms our multi-modal data into unimodal data. § Amenable to ESD/MAD! AK 21 Anomaly Detec$on (contd.) The decomposed Residual becomes "Uni-modal". This significantly shrinks the value of sigma. The original "Multi-Modal" Raw Data has a much wider value for sigma, leading ESD to miss a lot of the outliers.
- 22. Trend Smoothing Distortion Creates “Phantom” Anomalies • Challenges remain! AK 22 Anomaly Detec$on (contd.)
- 23. • Marrying Robust Statistics with Seasonal Decomposition AK 23 Anomaly Detec$on (contd.) Median is Free from Distortion
- 24. • Applying ESD on the Residual AK 24 Anomaly Detec$on (contd.) Decomposition Exposes Anomalies
- 25. • Recap q Extract the seasonal component using STL § Filters out periodic spikes q Residual = Raw - Seasonalraw- Medianraw q Run ESD on residual (using median and MAD) AK 25 Anomaly Detec$on (contd.)
- 26. • Illustrative example AK 26 Anomaly Detec$on (contd.)
- 27. • Applications q Three perspectives § Capacity o CPU utilization o Garbage collection o Network activity § User behavior o Events • Impressions • Link clicks o Spam § Forecasting AK 27 Anomaly Detec$on (contd.)
- 28. • Deployed in production q Used by large number of services at Twitter q Automatic e-mail notification § Only sent if anomalies are present § Anomalies annotated § CSV with anomaly locations attached AK 28 Anomaly Detec$on (contd.)
- 29. • Skyline from Etsy q https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py • Coming soon! q R package AK 29 Open Sourcing
- 30. Join the Flock Like problem solving? Like challenges? Be at cukng Edge Make an impact • We are hiring!! q https://twitter.com/JoinTheFlock q https://twitter.com/jobs q Contact us: @arun_kejariwal AK 30