4. Data
Fidelity
• Data-driven decision making
q Evolving product landscape
• Data partners
q Nielsen
q Dataminr
• Operational
q Performance and Availability
AK
4
5. Data
Fidelity:
Challenges
• Anomalies
q Exogenic factors
§ User behavior
§ Events
§ Data center
q Endogenic factors
§ Agile development
o Fail fast
§ Data collection
• Millions of time series [1,2]
q Scalability
AK
5
[1]
h9p://strata.oreilly.com/2013/09/how-‐twi9er-‐monitors-‐millions-‐of-‐$me-‐series.html
[2]
h9p://strataconf.com/strata2014/public/schedule/detail/32431
6. Anomaly
Detec$on:
Why
Bother?
• Analyze User Engagement
q Events
§ Super Bowl, Japanese New Year
q Year over year analysis (input to forecasting)
• Identify Attacks
q DoS
q Malware attacks
• Identify Bots
q Separating actual users from spam
AK
6
7. Anomaly
Detec$on
• Visual
q Prone to errors
q Not scalable
§ Machine generated data
11% of the digital universe in 2005
to > 40% by 2020 [1]
§ Cloud Infrastructure 2013-2017 CAGR ~50% [2]
• Algorithmic approach
q Automate!
[1]
h9p://www.emc.com/about/news/press/2012/20121211-‐01.htm
AK
7
[2]
h9p://www.forbes.com/sites/gilpress/2013/12/12/16-‐1-‐billion-‐big-‐data-‐market-‐2014-‐predic$ons-‐from-‐idc-‐and-‐iia/
8. Anomaly
Detec$on:
Background
• Over 50 years of research [1]
q Statistics
§ Extreme Value Theory
§ Robust Statistics, Grubb’s Test, ESD
q Econometrics
q Finance
§ Value at Risk (VaR)
q Signal Processing
q Music Information Retrieval
q Networking
q E- Commerce
q Performance Regression
[1]
“Anomaly
Detec$on”
by
Chandola
et
al.
ACM
Compu$ng
Surveys,
2009.
AK
8
Jon
from
Etsy
Toufic
from
Metafor
9. Anomaly
Detec$on:
Overview
• Definition
q “An anomaly is an observation that deviates so much from other observations so
as to arouse suspicions that it is was generated by a different mechanism” [1,2]
[1]
“Iden$fica$on
of
outliers”
by
Hawkins,
Douglas
M.
London:
Chapman
and
Hall,
1980.
AK
9
[2]
“Outlier
Analysis”
by
Charu
C.
Aggarwal.
Springer,
2013.
10. Anomaly
Detec$on
• Characterization
q Magnitude
q Width
q Frequency
q Direction
AK
10
11. Anomaly
Detec$on
(contd.)
• Two flavors
q Global
§ Max Value
q Local
§ Intra-day
AK
11
Global
Local
12. Anomaly
Detec$on
(contd.)
• Traditional Approaches
q Metrics
§ Mean μ
§ Variance σ
q Rule of thumb
§ μ + 3*σ
q Which time series?
§ Raw
§ Moving Averages
o SMA, EWMA, PEWMA
AK
12
3 * σ
13. Anomaly
Detec$on
(contd.)
• Impact of multi-modal distribution
q μ Shift ~ 0.2%
q Inflates σ by 4.5%
§ Miss quite a few anomalies
q What do multiple modes correspond to?
§ Seasonality
AK
13
14. • Robust Statistics
q MAD
§ Robust Breakdown point
o Median 50% vs. Mean 0%
q σMAD
§ K = 1.4826 for normally distributed data
AK
14
Anomaly
Detec$on
(contd.)
16. • Grubb’s Test
q Critical value is derived from data using a statistical confidence (α)
• Limitations
q Assumes data distribution is normal
q Good for detecting ONLY 1 outlier
q Seasonality unaware
AK
16
Anomaly
Detec$on
(contd.)
17. • ESD (Generalized Extreme Studentized Deviate) [1]
q Critical value (λi) re-calculated every iteration
q Largest i such that Ri > λi determines # of anomalies
q An upper-bound on the number of anomalies is an input parameter
• Limitations
q Generalized ESD assumes a “normal” distribution
q Seasonality unaware
AK
17
Anomaly
Detec$on
(contd.)
[1]
Rosner,
Bernard.
“Percentage
Points
for
a
Generalized
ESD
Many-‐outlier
Procedure.”
Technometrics
25,
no.
2
(1983):
165–172.
19. • Addressing Seasonality
q Key Idea
§ Time Series Decomposition
AK
19
Anomaly
Detec$on
(contd.)
20. • Determining seasonal component
q Regression on sub-cycle plots [1]
AK
20
Anomaly
Detec$on
(contd.)
[1]
“STL:
A
seasonal-‐trend
decomposi$on
procedure
based
on
loess”
by
Cleveland,
et
al.
Journal
of
Official
Sta$s$cs,
Vol.
6,
Issue
1,
1990.
21. • Impact of removal of seasonal and trend
q Transforms our multi-modal data into unimodal data.
§ Amenable to ESD/MAD!
AK
21
Anomaly
Detec$on
(contd.)
The decomposed Residual
becomes "Uni-modal". This
significantly shrinks the value of
sigma.
The original "Multi-Modal"
Raw Data has a much wider
value for sigma, leading ESD
to miss a lot of the outliers.
23. • Marrying Robust Statistics with Seasonal Decomposition
AK
23
Anomaly
Detec$on
(contd.)
Median is Free from Distortion
24. • Applying ESD on the Residual
AK
24
Anomaly
Detec$on
(contd.)
Decomposition Exposes Anomalies
25. • Recap
q Extract the seasonal component using STL
§ Filters out periodic spikes
q Residual = Raw - Seasonalraw- Medianraw
q Run ESD on residual (using median and MAD)
AK
25
Anomaly
Detec$on
(contd.)
27. • Applications
q Three perspectives
§ Capacity
o CPU utilization
o Garbage collection
o Network activity
§ User behavior
o Events
• Impressions
• Link clicks
o Spam
§ Forecasting
AK
27
Anomaly
Detec$on
(contd.)
28. • Deployed in production
q Used by large number of services at Twitter
q Automatic e-mail notification
§ Only sent if anomalies are present
§ Anomalies annotated
§ CSV with anomaly locations attached
AK
28
Anomaly
Detec$on
(contd.)
29. • Skyline from Etsy
q https://github.com/etsy/skyline/blob/master/src/analyzer/algorithms.py
• Coming soon!
q R package
AK
29
Open
Sourcing
30. Join
the
Flock
Like
problem
solving?
Like
challenges?
Be
at
cukng
Edge
Make
an
impact
• We are hiring!!
q https://twitter.com/JoinTheFlock
q https://twitter.com/jobs
q Contact us: @arun_kejariwal
AK
30