Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014

Five Things I Learned While Building
Anomaly Detection Tools
(Or: 5 things that bit me in the …)
Toufic Boubez, Ph.D.
Founder, CTO
Metafor Software
toufic@metaforsoftware.com

2
Preamble
• IANA Data Scientist! I’m just an engineer that needed to get stuff done!
• I learned (!) many more things, but cannnot be mentioned!
– Because lawyers 
– But ask me later 
• I usually beat up on parametric, Gaussian, supervised techniques
– This talk is not an exception,
– But more of a “lessons learned” message
• Note: all data real
• Note: no y-axis labels on charts – on purpose!!
• Note to self: remember to SLOW DOWN!
• Note to self: mention the cats!! Everybody loves cats!!

3
Toufic intro – who I am
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013
– I escaped 
• CTO Saffron Technology
• IBM Chief Architect for SOA
• Co-Author, Co-Editor: WS-Trust, WS-SecureConversation,
WS-Federation, WS-Policy
• Building large scale software systems for >20
years (I’m older than I look, I know!)

4
Why Anomaly Detection?
• Watching screens on the “Wall of Charts”
cannot scale!
– Leads to alert fatigue
• Need to automate detection of anomalous
behaviors
• Anomaly detection is the search for items or
events which do not conform to an expected
pattern. [Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly
detection: A survey". ACM Computing Surveys 41 (3): 1]

Thing 1:
Your data is NOT Gaussian
1

6
Gaussian or Normal distribution
• Bell-shaped distribution
– Has a mean and a standard deviation

7
This is Normally distributed data

9
Normal distributions are really useful
• I can make powerful predictions because of
the statistical properties of the data
• I can easily compare different metrics since
they have similar statistical properties
• There is a HUGE body of statistical work on
parametric techniques for normally
distributed data

Normally distributed vs Not
- Confidential - 10
Normal distributions
• Most naturally occurring
processes
• Population height, IQ
distributions (present
company excepted of
course)
• Widget sizes, weights in
manufacturing
• …
Not
• Your metrics!

11
Why is that important?
• Most analytics tools are based on two
assumptions:
1. Parametric techniques: Data is normally
distributed with a useful and usable mean
and standard deviation
2. Supervised Learning techniques: Data is
probabilistically “stationary”

12
Example: Three-Sigma Rule
• Three-sigma rule
– ~68% of the values lie within 1 std deviation of the mean
– ~95% of the values lie within 2 std deviations
– 99.73% of the values lie within 3 std deviations: anything
else is considered an outlier

13
Aaahhhh
• The mysterious red lines explained
3s
mean
3s

14
Doesn’t work because THIS

15
Histogram – probability distribution

19
Histogram – probability distribution

Thing 2:
2
Yesterday’s anomaly is today’s normal

22
• Most analytics tools are based on two
assumptions:
1. Parametric techniques: Data is normally
distributed with a useful and usable mean
and standard deviation
2. Supervised Learning techniques: Data is
probabilistically “stationary”

25
Its characteristics are stationary

26
Meanwhile, in our real world
• Stationarity is not a realistic assumption in the
large complex systems with which we’re
dealing
• “Concept Drift” is very common
– http://en.wikipedia.org/wiki/Concept_drift
“ … the statistical properties of the target variable, which
the model is trying to predict, change over time in
unforeseen ways. This causes problems because the
predictions become less accurate as time passes.”

27
Meanwhile, in our real world
• Stationarity is not a realistic assumption in the
large complex systems with which we’re
dealing
• “Concept Drift” is very common
– http://en.wikipedia.org/wiki/Concept_drift
“ … the statistical properties of the target variable, which
the model is trying to predict, change over time in
unforeseen ways. This causes problems because the
predictions become less accurate as time passes.”

28
Supervised learning
• In ML, Supervised Learning is the general set of
techniques for inferring a model from a set of
observations:
– Observations in a Training Set are labelled with the
desired outcomes (e.g. “normal vs. anomalous”,
“normal vs. fraudulent”, “red/green/yellow”, etc)
– As observations are fed into the learning system, it
learns to differentiate by inferring a model based on
these labels
– Once sufficiently “trained”, the system is used in
production on “real” unlabelled data and can label the
new data based on the inferred model

What happens when something changes in your
fundamentals?
29

This is your new normal: all red all the time
30

31
Mean Shift and Breakout Detection
• https://blog.twitter.com/2014/breakout-detection-
in-the-wild

Thing 3:
Saying Kolmogorov-Smirnov is a great way to
impress everyone
3

33
• Seriously!?
• Ok, actually non-parametric techniques that
make no assumptions about normality or any
other probability distribution are crucial in
your effort to understand what’s going on in
your systems

34
The Kolmogorov-Smirnov test
• Non-parametric test
– Compare two probability
distributions
– Makes no assumptions (e.g.
Gaussian) about the
distributions of the samples
– Measures maximum
distance between
cumulative distributions
– Can be used to compare
periodic/seasonal metric
periods (e.g. day-to-day or
week-to-week)
http://en.wikipedia.org/wiki/Kol
mogorov%E2%80%93Smirnov_te
st

Cumulative distribution for those windows
37

38
Data from dissimilar windows

Cumulative distribution for those windows
39

40
Sliding window of KS scores

Thing 4:
4
Take Scope and Context into account!

43
Some data – is that normal?

47
Is every weekend an anomaly?

48
Would this be more accurate?

49
Use domain knowledge!
• Domain knowledge is NOT a bad thing!
– There is no algorithm that will work on everything
– Know your data and it general patterns
• Periodicity/Seasonality
• Known events (maintenance, backups, etc)
– Apply the appropriate algorithms, taking into
account enough scope for any inherent periodicity
to appear
– Customize your alerts to take into accounts known
events

Thing 5:
No data != No information

51
• Some data channels are inherently non-chatty:
– We don’t have the luxury of always generating
non-zero values
– There is a lot of useful information in the fact that
nothing is happening on a particular channel
• A lot of time series analytics techniques fail on
time series with too few values (e.g. RF,
adjusted box plot, etc)

55
Simple lookup table with priors

56
Don’t be an analytics snob
• Sparse data is VERY hard to analyze using
typical analytics techniques
• Sparse data conveys VERY important
information
• Sometimes the simplest rules, thresholds,
lookup tables will work

57
Recap
1. Your data is NOT Gaussian
2. Yesterday’s anomaly is today’s normal
3. Kolmogorov-Smirnov is really cool
4. Scope and Context are important
5. No data != No information

58
Questions?
• Shout out to the Metafor Data Science team!
– Fred Zhang
– Iman Makaremi

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014

Similar to Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014 (20)

Recently uploaded

Recently uploaded (20)

Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez - Metafor Software - LISA 2014