Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

Beyond Pretty Charts
Analytics for the Cloud Infrastructure
Velocity Europe 2013
Toufic Boubez, Ph.D.
Co-Founder, CTO
Metafor Software
toufic@metaforsoftware.com
@tboubez

Toufic intro – who I am
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013
– I escaped 

• Co-Founder/CTO Saffron Technology
• IBM Chief Architect for SOA
• Co-Author, Co-Editor: WS-Trust, WSSecureConversation, WS-Federation, WS-Policy
• Building large scale software systems for 20 years
(I’m older than I look, I know!)
2

Genesis of this talk
• Evolving from various conference presentations
– Blog:http://www.metaforsoftware.com/category/ano
maly-detection-101/
– Many briefly mentioned issues, never explored
– Needed more details and examples
•
•
•
•

Note: real data
Note: no y-axis labels on charts – on purpose!!
Note to self: remember to SLOW DOWN!
Note to self: mention the cats!! Everybody loves cats!!

3

The WoC side-effects: alert fatigue
“Alert fatigue is the single
biggest problem we have
right now … We need to be
more intelligent about our
alerts or we’ll all go insane.”
- John Vincent (@lusis)
(#monitoringsucks)

5

The fallacy of thresholds
• So what if my unicorn usage is at 89-91%, and has been stable?
• I’d much rather know if it’s at 60% and has been rapidly increasing

• Static thresholds and rules won’t help you in this case
6

Work smarter not harder
• We don’t need more metrics
• We don’t need more thresholds and rules
• We DO need better, smarter tools

7

TO THE RESCUE: Anomaly Detection!!
• Anomaly detection (also known as outlier
detection) is the search for items or events
which do not conform to an expected pattern.
[Chandola, V.; Banerjee, A.; Kumar, V. (2009). "Anomaly detection: A
survey". ACM Computing Surveys 41 (3): 1]

• For devops: Need to know when one or more
of our metrics is going wonky

8

#monitoringsucks vs #i monitoring
• Proper monitoring tools should give us all the
information we need to be PROACTIVE
– But they don’t

• Current monitoring tools assume that the
underlying system is relatively static
– Surround it with static thresholds and rules.
– Good for detecting catastrophic events but not
much else
– WHY!!??
9

“Traditional” analytics …
• Roots in manufacturing process QC

10

… are based on Gaussian distributions
• Make assumptions about probability
distributions and process behaviour
– Usually assume data is normally distributed
with a useful and usable mean and standard
deviation

11

Three-Sigma Rule
• Three-sigma rule
– ~68% of the values lie within 1 std deviation of the mean
– ~95% of the values lie within 2 std deviations
– 99.73% of the values lie within 3 std deviations: anything
else is an outlier

14

Aaahhhh
• The mysterious red lines explained

15

The four horsemen
• Four horsemen of the modelpocalypse™ 
[Abe Stanway & Jon Cowie http://www.slideshare.net/jonlives/bring-thenoise]

– Seasonality
– Spike influence
– Normality
– Parameters

16

Moving Averages for detecting outliers
• Moving Averages “Big idea”:
– At any point in time in a well-behaved time series,
your next value should not significantly deviate
from the general trend of your data
– Mean as a predictor is too static, relies on too
much past data (ALL of the data!)
– Instead of overall mean use a finite window of
past values, predict most likely next value
– Alert if actual value “significantly” (3 sigmas?)
deviates from predicted value
17

Simple and Weighted Moving Averages
• Simple Moving Average
– Average of last N values in your time series
• S[t] <- sum(X[t-(N-1):t])/N

– Each value in the window contributes equally to
prediction
– …INCLUDING spikes and outliers

• Weigthed Moving Average
– Similar to SMA but assigns linearly (arithmetically)
decreasing weights to every value in the window
– Older values contribute less to the prediction
18

Exponential Smoothing
• Exponential Smoothing
– Similar to weighted average, but with weights decay exponentially
over the whole set of historic samples
• S[t]=αX[t-1] + (1-α)S[t-1]

– Does not deal with trends in data

• DES
– In addition to data smoothing factor (α), introduces a trend smoothing
factor (β)
– Better at dealing with trending
– Does not deal with seasonality in data

• TES, Holt-Winters
– Introduces additional seasonality factor
– … and so on

• ALL assume Gaussian!

19

Gaussian distributions are powerful because:
• Far far in the future, in a galaxy far far away:
– I can make the same predictions because the
statistical properties of the data haven’t changed
– I can easily compare different metrics since they
have similar statistical properties

• BUT…
• Cue in DRAMATIC MUSIC
20

Another common distribution

22

Let’s look at an example

23

Histogram – probability distribution

26


30

Are we doomed?
• No!
• There are lots of other non-Gaussian based
techniques:
– Adaptive Mixture of Gaussians
– Non-parametric techniques
(http://www.metaforsoftware.com/everythingyou-should-know-about-anomaly-detectionknow-your-data-parametric-or-non-parametric/)
– Spectral analysis
31

Kolmogorov-Smirnov test
• Non-parametric test
– Compare two probability
distributions
– Makes no assumptions (e.g.
Gaussian) about the
distributions of the samples
– Measures maximum
distance between
cumulative distributions
– Can be used to compare
periodic/seasonal metric
periods (e.g. day-to-day or
week-to-week)

http://en.wikipedia.org/wiki/Kolmogorov%E2%
80%93Smirnov_test

32

KS test on slow memory leak

35


36

We’re not doomed, but: Know your data!!
• You need to understand the statistical
properties of your data, and where it comes
from, in order to determine what kind of
analytics to use.
• A large amount of data center data is nonGaussian
– Guassian statistics won’t work
– Use appropriate techniques
37

Pet Peeve: How much data do we need?
• Trend towards higher and higher sampling
rates in data collection
• Reminds me of Jorge Luis Borges’ story about
Funes the Memorious
– Perfect recollection of the slightest details of every
instant of his life, but lost the ability for
abstraction

• Our brain works on abstraction
– We notice patterns BECAUSE we can abstract
38

The danger of over-abstraction

+
= comfortable?
39

So, how much data DO you need?
• You don’t need more resolution that twice
your highest frequency (Nyquist-Shanon
sampling theorem)
• Most of the algorithms for analytics will
smooth, average, filter, and pre-process the
data.
• Watch out for correlated metrics (e.g. used vs.
available memory)
40

Think: Is all data important to collect?
• Two camps:
– Data is data, let’s collect and analyze everything and
figure out the trends.
– Not all data is important, so let’s figure out what’s
important first and understand the underlying model
so we don’t waste resources on the rest.

• Similar to the very public bun fight between
Noam Chomsky and Peter Norvig
– http://norvig.com/chomsky.html

• Unresolved as far as I know 
41

Shout out to etsy
• Check out kale:
• Check out kale for some analytics:
– http://codeascraft.com/2013/06/11/introducingkale/
– https://github.com/etsy/skyline/blob/master/src/
analyzer/algorithms.py

42

More?
• Only scratched the surface
• I want to talk more about algorithms, analytics,
current issues, etc, in more depth, but time’s up!!
– Go back in time to me Office Hours session, or
– Come talk to me or email me if interested.

• Thank you!
toufic@metaforsoftware.com
@tboubez
43

Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

Similar to Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure. (20)

Recently uploaded

Recently uploaded (20)

Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastructure.

Editor's Notes