This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
WSO2CON 2024 - Lessons from the Field: Legacy Platforms – It's Time to Let Go...
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014
1. Five Things I Learned While Building
Anomaly Detection Tools
(Or: 5 things that bit me in the …)
Toufic Boubez, Ph.D.
Founder, CTO
Metafor Software
toufic@metaforsoftware.com
2. 2
Preamble
• IANA Data Scientist! I’m just an engineer that needed to get stuff done!
• I learned (!) many more things, but cannnot be mentioned!
– Because lawyers
– But ask me later
• I usually beat up on parametric, Gaussian, supervised techniques
– This talk is not an exception,
– But more of a “lessons learned” message
• Note: all data real
• Note: no y-axis labels on charts – on purpose!!
• Note to self: remember to SLOW DOWN!
• Note to self: mention the cats!! Everybody loves cats!!
3. 3
Toufic intro – who I am
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013
– I escaped
• CTO Saffron Technology
• IBM Chief Architect for SOA
• Co-Author, Co-Editor: WS-Trust, WS-SecureConversation,
WS-Federation, WS-Policy
• Building large scale software systems for >20
years (I’m older than I look, I know!)
4. 4
Why Anomaly Detection?
• Watching screens on the “Wall of Charts”
cannot scale!
– Leads to alert fatigue
• Need to automate detection of anomalous
behaviors
• Anomaly detection is the search for items or
events which do not conform to an expected
pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly
detection: A survey". ACM Computing Surveys 41 (3): 1]
9. 9
Normal distributions are really useful
• I can make powerful predictions because of
the statistical properties of the data
• I can easily compare different metrics since
they have similar statistical properties
• There is a HUGE body of statistical work on
parametric techniques for normally
distributed data
10. Normally distributed vs Not
- Confidential - 10
Normal distributions
• Most naturally occurring
processes
• Population height, IQ
distributions (present
company excepted of
course)
• Widget sizes, weights in
manufacturing
• …
Not
• Your metrics!
11. 11
Why is that important?
• Most analytics tools are based on two
assumptions:
1. Parametric techniques: Data is normally
distributed with a useful and usable mean
and standard deviation
2. Supervised Learning techniques: Data is
probabilistically “stationary”
12. 12
Example: Three-Sigma Rule
• Three-sigma rule
– ~68% of the values lie within 1 std deviation of the mean
– ~95% of the values lie within 2 std deviations
– 99.73% of the values lie within 3 std deviations: anything
else is considered an outlier
13. 13
Aaahhhh
• The mysterious red lines explained
3s
mean
3s
21. Thing 2:
2
Yesterday’s anomaly is today’s normal
22. 22
Why is that important?
• Most analytics tools are based on two
assumptions:
1. Parametric techniques: Data is normally
distributed with a useful and usable mean
and standard deviation
2. Supervised Learning techniques: Data is
probabilistically “stationary”
26. 26
Meanwhile, in our real world
• Stationarity is not a realistic assumption in the
large complex systems with which we’re
dealing
• “Concept Drift” is very common
– http://en.wikipedia.org/wiki/Concept_drift
“ … the statistical properties of the target variable, which
the model is trying to predict, change over time in
unforeseen ways. This causes problems because the
predictions become less accurate as time passes.”
27. 27
Meanwhile, in our real world
• Stationarity is not a realistic assumption in the
large complex systems with which we’re
dealing
• “Concept Drift” is very common
– http://en.wikipedia.org/wiki/Concept_drift
“ … the statistical properties of the target variable, which
the model is trying to predict, change over time in
unforeseen ways. This causes problems because the
predictions become less accurate as time passes.”
28. 28
Supervised learning
• In ML, Supervised Learning is the general set of
techniques for inferring a model from a set of
observations:
– Observations in a Training Set are labelled with the
desired outcomes (e.g. “normal vs. anomalous”,
“normal vs. fraudulent”, “red/green/yellow”, etc)
– As observations are fed into the learning system, it
learns to differentiate by inferring a model based on
these labels
– Once sufficiently “trained”, the system is used in
production on “real” unlabelled data and can label the
new data based on the inferred model
30. This is your new normal: all red all the time
30
31. 31
Mean Shift and Breakout Detection
• https://blog.twitter.com/2014/breakout-detection-
in-the-wild
32. Thing 3:
Saying Kolmogorov-Smirnov is a great way to
impress everyone
3
33. 33
Why is that important?
• Seriously!?
• Ok, actually non-parametric techniques that
make no assumptions about normality or any
other probability distribution are crucial in
your effort to understand what’s going on in
your systems
34. 34
The Kolmogorov-Smirnov test
• Non-parametric test
– Compare two probability
distributions
– Makes no assumptions (e.g.
Gaussian) about the
distributions of the samples
– Measures maximum
distance between
cumulative distributions
– Can be used to compare
periodic/seasonal metric
periods (e.g. day-to-day or
week-to-week)
http://en.wikipedia.org/wiki/Kol
mogorov%E2%80%93Smirnov_te
st
49. 49
Use domain knowledge!
• Domain knowledge is NOT a bad thing!
– There is no algorithm that will work on everything
– Know your data and it general patterns
• Periodicity/Seasonality
• Known events (maintenance, backups, etc)
– Apply the appropriate algorithms, taking into
account enough scope for any inherent periodicity
to appear
– Customize your alerts to take into accounts known
events
51. 51
Why is that important?
• Some data channels are inherently non-chatty:
– We don’t have the luxury of always generating
non-zero values
– There is a lot of useful information in the fact that
nothing is happening on a particular channel
• A lot of time series analytics techniques fail on
time series with too few values (e.g. RF,
adjusted box plot, etc)
56. 56
Don’t be an analytics snob
• Sparse data is VERY hard to analyze using
typical analytics techniques
• Sparse data conveys VERY important
information
• Sometimes the simplest rules, thresholds,
lookup tables will work
57. 57
Recap
1. Your data is NOT Gaussian
2. Yesterday’s anomaly is today’s normal
3. Kolmogorov-Smirnov is really cool
4. Scope and Context are important
5. No data != No information
58. 58
Questions?
• Shout out to the Metafor Data Science team!
– Fred Zhang
– Iman Makaremi