Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

Anomaly Detection Analytics for the
Data Centre
devopsdays Vancouver
25 October 2013
Toufic Boubez, Ph.D.
Co-Founder, CTO
Metafor Software

Toufic intro – who I am
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013
– I escaped 

• Co-Founder/CTO Saffron Technology
• Chief Architect IBM (SOA)
• Building large scale software systems for 20
years (I’m older than I look, I know!)
2

Why this talk?
• April: devopsdays Austin: Open Space talk
– Blog: http://metaforsoftware.com/beyond-the-prettycharts-a-report-from-devopsdays-in-austin/

• June: devopsdays Silicon Valley presentation:
– Five major lessons learned

• Explore issues mentioned in June
•
•
•
•

Note: real data
Note: no labels on charts – on purpose!!
Note to self: remember to SLOW DOWN!
Note to self: mention the cats!! Everybody loves cats!!

3

The Wall of Charts side-effects
Alert Overload

Metrics Overload

“Alert fatigue is the single
biggest problem we have
right now … We need to be
more intelligent about our
alerts or we’ll all go insane.”
- John
Vincent, Monitorama, March
2013

5

Need mo’ better alerting
– So what if my unicorn usage is at 89-91%, and has been stable?
– I’d much rather know if it’s at 60% and has been rapidly increasing

– Static thresholds and rules won’t help you in this case
– Need some intelligent Anomaly Detection mechanism

6

Anomaly Detection for DevOps
• Anomaly detection (also known as outlier
detection) is the search for items or events
which do not conform to an expected pattern.
[Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A
survey". ACM Computing Surveys 41 (3): 1]

• For devops: Need to know when one or more
of our metrics is going wonky

7

#monitoringsucks vs #iheartmonitoring
• Proper monitoring tools should give us all the
information we need to be PROACTIVE
– But they don’t

• Current monitoring tools assume that the
underlying system is relatively static
– Surround it with static thresholds and rules.
– Good for detecting catastrophic events but not
much else
– BUT WHY!!??
8

“Traditional” analytics …
• Roots in manufacturing process QC

9

… are based on Gaussian distributions
• Makes assumptions about probability
distributions and process behaviour
– Usually assumes data is normally distributed with
a useful and usable mean and standard deviation

• Blah blah blah what does it mean?

10

Distribution Schmistribution

12

Three-Sigma Rule
• Three-sigma rule
– ~68% of the values lie within 1 std deviation of the mean
– ~95% of the values lie within 2 std deviations
– 99.73% of the values lie within 3 std deviations

13

Aaahhhh
• The mysterious red lines explained

14

Moving Averages for detecting outliers
• Big idea:
– Based on past values, predict most likely next value
– Alert if actual value “significantly” deviates from
predicted value

• Simple Moving Average
– Average of last N values in your time series
• S[t] <- sum(X[t-(N-1):t])/N

– Each value in the window contributes equally to
prediction
– Idea is that your next value should not significantly
deviate from the general trend of your data
15

Weighted Moving Average
• Weigthed Moving Average
– Similar to SMA but assigns linearly (arithmetically)
decreasing weights to every value in the window
– Older values contribute less to the prediction

• Neither SMA or WMA deal well with
periodicity in your data

16

Exponential Smoothing
• Exponential Smoothing
– Similar to weighted average, but with weights decay
exponentially over the whole set of historic samples
• S[t]=αX[t-1] + (1-α)S[t-1]

– Is as almost as bad as moving averages in dealing with
periodicity and trending time series!!

• DES: Holt-Winters
– In addition to data smoothing factor (α), introduces a
trend smoothing factor (β)
– Better at dealing with periodicity and trending

• ALL assume Gaussian!
17

Gaussian distributions are powerful because:
• Far far in the future, in a galaxy far far away:
– I can make the same predictions because the
statistical properties of the data haven’t changed
– I can compare different metrics since they have
similar statistical properties

• BUT…
• Cue in DRAMATIC MUSIC
18

Another common distribution

20

Let’s look at an example

21

Histogram – probability distribution

22

Are we doomed?
• There’s A LOT you can do with the data, other
than just looking at it and putting thresholds!
– Adaptive Mixture of Gaussians
– Non-parametric techniques
(http://www.metaforsoftware.com/everythingyou-should-know-about-anomaly-detectionknow-your-data-parametric-or-non-parametric/)
– Spectral analysis

25

We’re not doomed, but: Know your data!!
• You need to understand the statistical
properties of your data, and where it comes
from, in order to determine what kind of
analytics to use.
• A large amount of data center data is nonGaussian
– Guassian statistics won’t work
– Use appropriate techniques
27

Pet Peeve #1: How much data do we need?
• Trend towards higher and higher sampling
rates in data collection
• Reminds me of Jorge Luis Borges’ story about
Funes the Memorious
– Perfect recollection of the slightest details of every
instant of his life, but lost the ability for
abstraction

• Our brain works on abstraction
– We notice patterns BECAUSE we can abstract
28

The danger of over-abstraction

+
= comfortable?
29

So, how much data DO you need?
• You don’t need more resolution that twice
your highest frequency (Nyquist-Shanon
sampling theorem)
• Most of the algorithms for analytics will
smooth, average, filter, and pre-process the
data.
• Watch out for correlated metrics (e.g. used vs.
available memory)
30

Think: Is all data important to collect?
• Two camps:
– Data is data, let’s collect and analyze everything and
figure out the trends.
– Not all data is important, so let’s figure out what’s
important first and understand the underlying model
so we don’t waste resources on the rest.

• Similar to the very public bun fight between
Noam Chomsky and Peter Norvig
– http://norvig.com/chomsky.html

• Unresolved as far as I know 
31

More?
• Only scratched the surface
• I want to talk more about analytics, in more
depth, but time’s up!!
– (Actually Jenny won’t let me)

• Come talk to me during the breaks!
• Thank you!

33

Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

Recommended

Recommended

More Related Content

Similar to Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

Similar to Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25 (20)

Recently uploaded

Recently uploaded (20)

Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25

Editor's Notes