SlideShare a Scribd company logo
1 of 55
1
Some Simple Math to Get Signal
out of your data noise
#VelocityConf
25.06.2014
Toufic Boubez, Ph.D.
Co-Founder, CTO
Metafor Software
toufic@metaforsoftware.com
2
Preamble
• I lied: There is no “simple” math for Anomaly Detection!
• I usually beat up on parametric, Gaussian, supervised techniques
– This talk is to show some alternatives
– Only enough time to cover a couple (four, really) of relatively simple
but very useful techniques
– Oh, and I will actually still start up by beating up on the usual suspects,
but don’t despair, there’s good stuff towards the end
• Note: all real data
• Note: no y-axis labels on charts – on purpose!!
• Note to self: remember to SLOW DOWN!
• Note to self: mention the cats!! Everybody loves cats!!
3
• Co-Founder/CTO Metafor Software
• Co-Founder/CTO Layer 7 Technologies
– Acquired by Computer Associates in 2013
– I escaped 
• Co-Founder/CTO Saffron Technology
• IBM Chief Architect for SOA/Web Services
• Co-Author, Co-Editor: WS-Trust, WS-
SecureConversation, WS-Federation, WS-Policy
• Building large scale software systems for >20
years (I’m older than I look, I know!)
Toufic intro – who I am
4
Wall of Charts™
5
The WoC side-effects: alert fatigue
“Alert fatigue is the single biggest problem we
have right now … We need to be more
intelligent about our alerts or we’ll all go
insane.”
- John Vincent (@lusis)
(#monitoringsucks)
We have forensic tools for
analytics after the fact BUT
we need to KNOW that
something has happened!
We need alerts!
6
Watching screens cannot scale
7
Time to turn things over to the machines!
8
Attempt #1: static thresholds …
• Roots in manufacturing process QC
9
… are based on Gaussian distributions
• Make assumptions about probability
distributions and process behaviour
– Data is normally distributed with a useful and
usable mean and standard deviation
– Data is probabilistically “stationary”
10
Three-Sigma Rule
• Three-sigma rule
– ~68% of the values lie within 1 std deviation of the mean
– ~95% of the values lie within 2 std deviations
– 99.73% of the values lie within 3 std deviations: anything
else is an outlier
11
Aaahhhh
• The mysterious red lines explained
mean
3s
3s
12
Stationary Gaussian distributions are powerful
• Because far far in the future, in a galaxy far far
away:
– I can make the same predictions because the
statistical properties of the data haven’t changed
– I can easily compare different metrics since they
have similar statistical properties
• Let’s do this!!
• BUT…
• Cue in DRAMATIC MUSIC
13
Then THIS happens
14
3-sigma rule alerts
15
Or worse, THIS happens!
16
3-sigma rule alerts
17
WTF!? So what gives!?
• Remember this?
18
Histogram – probability distribution
19
Histogram – probability distribution
20
Attempts #2, #3, etc: mo’ better thresholds
• Static thresholds ineffective on dynamic data
– Thresholds use the (static) mean as predictor and
alert if data falls more than 3 sigma away
• Need “moving” or “adaptive” thresholds:
– Value of mean changes with time to
accommodate new data values, trends, periodicity
21
Moving Averages “big idea”
• At any point in time in a well-behaved time
series, your next value should not significantly
deviate from the general trend of your data
• Mean as a predictor is too static, relies on too
much past data (ALL of the data!)
• Instead of overall mean use a finite window of
past values, predict most likely next value
• Alert if actual value “significantly” (3 sigmas?)
deviates from predicted value
22
Moving Averages typical method
• Generate a “smoothed” version of the time series
– Average over a sliding (moving) window
• Compute the squared error between raw series
and its smoothed version
• Compute a new effective standard deviation
(sigma’) by smoothing the squared error
• Generate a moving threshold:
– Outliers are 3-sigma’ outside the new, smoothed data!
• Ta-da!
23
Simple and Weighted Moving Averages
• Simple Moving Average
– Average of last N values in your time series
• S[t] <- sum(X[t-(N-1):t])/N
– Each value in the window contributes equally to
prediction
– …INCLUDING spikes and outliers
• Weigthed Moving Average
– Similar to SMA but assigns linearly (arithmetically)
decreasing weights to every value in the window
– Older values contribute less to the prediction
24
Exponential Smoothing techniques
• Exponential Smoothing
– Similar to weighted average, but with weights decay
exponentially over the whole set of historic samples
• S[t]=αX[t-1] + (1-α)S[t-1]
– Does not deal with trends in data
• DES
– In addition to data smoothing factor (α), introduces a trend
smoothing factor (β)
– Better at dealing with trending
– Does not deal with seasonality in data
• TES, Holt-Winters
– Introduces additional seasonality factor
– … and so on
25
Example 1
26
Holt-Winters predictions
27
Example 2
28
Exponential smoothing predictions
29
Hmmmm, so are we doomed?
• No!
• ALL smoothing predictive methods work best
with normally distributed data!
• But there are lots of other non-Gaussian
based techniques
– We can only scratch the surface in this talk
30
Trick #1: Histogram!
31
THIS is normal
32
This isn’t
33
Neither is this
34
Trick #2: Kolmogorov-Smirnov test
• Non-parametric test
– Compare two probability
distributions
– Makes no assumptions (e.g.
Gaussian) about the
distributions of the samples
– Measures maximum
distance between
cumulative distributions
– Can be used to compare
periodic/seasonal metric
periods (e.g. day-to-day or
week-to-week)
http://en.wikipedia.org/wiki/Kolmogorov%E2%
80%93Smirnov_test
35
KS with windowing
36
37
38
39
40
41
KS Test on difficult data
42
Trick #3: Box Plots / Tukey
• Again, need non-parametric method:
– Does not rely on mean and standard deviation
• When you can’t count on good old Gaussian:
– Median is always a great alternative to the mean
– Quartiles are an alternative to standard deviation
• Q1 = 25% Quartile (25% of the data)
• Q2 = 50% Quartile == Median (50% of the data)
• Q3 = 75% Quartile (75% of the data)
• Interquartile Range (IQR) = Q3 – Q1
43
Example: box plots and fences for a Gaussian
http://en.wikipedia.org/wiki/Interquartile_range
44
IQR method for streaming time series
• IQR method works well for some non-normal
distributions
– Generates continuously adaptive fences at
(Q1 - 1.5xIRQ) and (Q3 + 1.5xIQR)
– Adjusted box plot uses fences at
(Q1 - 1.5xIRQ) and (Q3 + 1.5x IQR)
• Method:
– As time series is streaming, for every window:
• Re-compute quartiles
• Re-compute IQR, fences
• Determine if any outliers
• Repeat
45
Example 1
46
Example 1 – Good
47
Example 2
48
Example 2 – Bad
49
Trick #4: Diffing/Derivatives
• Often, even when the data itself is not
stationary, its derivatives tends to be!
• Most frequently, first difference is sufficient:
dS(t) <- S(t+1) – S(t)
• Can then perform some analytics on first
difference
50
CPU time series
51
Its first difference – possible random walk?
52
Trick #5: Neural Networks
• Really?
• No time – To Be Continued!
53
We’re not doomed, but: Know your data!!
• You need to understand the statistical properties
of your data, and where it comes from, in order
to determine what kind of analytics to use.
– Your data is very important!
– You spend time collecting it so spend time analyzing
it!
• A large amount of data center data is non-
Gaussian
– Guassian statistics won’t work
– Use appropriate techniques
54
More?
• Only scratched the surface
• I want to talk more about algorithms, analytics,
current issues, etc, in more depth, but time’s up!!
– Come talk to me or email me if interested.
• Office Hour: Tomorrow at 11:30 Booth #801
• Thank you!
toufic@metaforsoftware.com
@tboubez
55
Oh yeah, and we’re hiring!
In Vancouver, Canada

More Related Content

Viewers also liked

Mineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, guptaMineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, guptaCarlos Barreto Gamarra
 
JKSimMet Course - Part 1
JKSimMet Course - Part 1JKSimMet Course - Part 1
JKSimMet Course - Part 1James Didovich
 
The Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal DistributionsThe Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal DistributionsSCE.Surat
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...tboubez
 
Diseño de experimetos pres
Diseño de experimetos presDiseño de experimetos pres
Diseño de experimetos presJose Ojeda
 
Holistic modelling of mineral processing plants a practical approach
Holistic modelling of mineral processing plants   a practical approachHolistic modelling of mineral processing plants   a practical approach
Holistic modelling of mineral processing plants a practical approachBasdew Rooplal
 
Guía métodos estadísticos.
Guía métodos estadísticos.Guía métodos estadísticos.
Guía métodos estadísticos.Lenin Goursa
 
Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI
Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI
Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI Pedro Salcedo Lagos
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Sampling and analysis for feasibility studies and mineral processing
Sampling and analysis for feasibility studies and mineral processingSampling and analysis for feasibility studies and mineral processing
Sampling and analysis for feasibility studies and mineral processingBasdew Rooplal
 
Diseño experimental Teoría General
Diseño experimental Teoría GeneralDiseño experimental Teoría General
Diseño experimental Teoría Generalug-dipa
 
Diseño de Experimentos (DOE)
Diseño de Experimentos (DOE)Diseño de Experimentos (DOE)
Diseño de Experimentos (DOE)Francisco Navar
 
Predictions 2016
Predictions 2016Predictions 2016
Predictions 2016Lynn Yap
 

Viewers also liked (19)

Mineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, guptaMineral processing-design-and-operation, gupta
Mineral processing-design-and-operation, gupta
 
JKSimMet Course - Part 1
JKSimMet Course - Part 1JKSimMet Course - Part 1
JKSimMet Course - Part 1
 
Jk (1)
Jk (1)Jk (1)
Jk (1)
 
Presentación del curso,agosto,18,2014
Presentación del curso,agosto,18,2014Presentación del curso,agosto,18,2014
Presentación del curso,agosto,18,2014
 
The Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal DistributionsThe Binomial, Poisson, and Normal Distributions
The Binomial, Poisson, and Normal Distributions
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
 
Estadística: Revisión Estadística
Estadística: Revisión EstadísticaEstadística: Revisión Estadística
Estadística: Revisión Estadística
 
Clase 4 diseños de bloques - final
Clase 4   diseños de bloques - finalClase 4   diseños de bloques - final
Clase 4 diseños de bloques - final
 
Diseño de experimetos pres
Diseño de experimetos presDiseño de experimetos pres
Diseño de experimetos pres
 
Holistic modelling of mineral processing plants a practical approach
Holistic modelling of mineral processing plants   a practical approachHolistic modelling of mineral processing plants   a practical approach
Holistic modelling of mineral processing plants a practical approach
 
Guía métodos estadísticos.
Guía métodos estadísticos.Guía métodos estadísticos.
Guía métodos estadísticos.
 
MINERAL PROCESSING SIMULATION USING MODSIM SOFWARE
MINERAL PROCESSING SIMULATION USING MODSIM SOFWAREMINERAL PROCESSING SIMULATION USING MODSIM SOFWARE
MINERAL PROCESSING SIMULATION USING MODSIM SOFWARE
 
sampling
samplingsampling
sampling
 
Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI
Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI
Diseños de Experimentos - CAPITULO 07 HERNANDEZ SAMPERI
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Sampling and analysis for feasibility studies and mineral processing
Sampling and analysis for feasibility studies and mineral processingSampling and analysis for feasibility studies and mineral processing
Sampling and analysis for feasibility studies and mineral processing
 
Diseño experimental Teoría General
Diseño experimental Teoría GeneralDiseño experimental Teoría General
Diseño experimental Teoría General
 
Diseño de Experimentos (DOE)
Diseño de Experimentos (DOE)Diseño de Experimentos (DOE)
Diseño de Experimentos (DOE)
 
Predictions 2016
Predictions 2016Predictions 2016
Predictions 2016
 

Recently uploaded

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Recently uploaded (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Simple math to get signal out of your data noise - Anomaly Detection - Toufic Boubez - Metafor Software - Velocity Santa Clara 2014-06-25

  • 1. 1 Some Simple Math to Get Signal out of your data noise #VelocityConf 25.06.2014 Toufic Boubez, Ph.D. Co-Founder, CTO Metafor Software toufic@metaforsoftware.com
  • 2. 2 Preamble • I lied: There is no “simple” math for Anomaly Detection! • I usually beat up on parametric, Gaussian, supervised techniques – This talk is to show some alternatives – Only enough time to cover a couple (four, really) of relatively simple but very useful techniques – Oh, and I will actually still start up by beating up on the usual suspects, but don’t despair, there’s good stuff towards the end • Note: all real data • Note: no y-axis labels on charts – on purpose!! • Note to self: remember to SLOW DOWN! • Note to self: mention the cats!! Everybody loves cats!!
  • 3. 3 • Co-Founder/CTO Metafor Software • Co-Founder/CTO Layer 7 Technologies – Acquired by Computer Associates in 2013 – I escaped  • Co-Founder/CTO Saffron Technology • IBM Chief Architect for SOA/Web Services • Co-Author, Co-Editor: WS-Trust, WS- SecureConversation, WS-Federation, WS-Policy • Building large scale software systems for >20 years (I’m older than I look, I know!) Toufic intro – who I am
  • 5. 5 The WoC side-effects: alert fatigue “Alert fatigue is the single biggest problem we have right now … We need to be more intelligent about our alerts or we’ll all go insane.” - John Vincent (@lusis) (#monitoringsucks) We have forensic tools for analytics after the fact BUT we need to KNOW that something has happened! We need alerts!
  • 7. 7 Time to turn things over to the machines!
  • 8. 8 Attempt #1: static thresholds … • Roots in manufacturing process QC
  • 9. 9 … are based on Gaussian distributions • Make assumptions about probability distributions and process behaviour – Data is normally distributed with a useful and usable mean and standard deviation – Data is probabilistically “stationary”
  • 10. 10 Three-Sigma Rule • Three-sigma rule – ~68% of the values lie within 1 std deviation of the mean – ~95% of the values lie within 2 std deviations – 99.73% of the values lie within 3 std deviations: anything else is an outlier
  • 11. 11 Aaahhhh • The mysterious red lines explained mean 3s 3s
  • 12. 12 Stationary Gaussian distributions are powerful • Because far far in the future, in a galaxy far far away: – I can make the same predictions because the statistical properties of the data haven’t changed – I can easily compare different metrics since they have similar statistical properties • Let’s do this!! • BUT… • Cue in DRAMATIC MUSIC
  • 15. 15 Or worse, THIS happens!
  • 17. 17 WTF!? So what gives!? • Remember this?
  • 20. 20 Attempts #2, #3, etc: mo’ better thresholds • Static thresholds ineffective on dynamic data – Thresholds use the (static) mean as predictor and alert if data falls more than 3 sigma away • Need “moving” or “adaptive” thresholds: – Value of mean changes with time to accommodate new data values, trends, periodicity
  • 21. 21 Moving Averages “big idea” • At any point in time in a well-behaved time series, your next value should not significantly deviate from the general trend of your data • Mean as a predictor is too static, relies on too much past data (ALL of the data!) • Instead of overall mean use a finite window of past values, predict most likely next value • Alert if actual value “significantly” (3 sigmas?) deviates from predicted value
  • 22. 22 Moving Averages typical method • Generate a “smoothed” version of the time series – Average over a sliding (moving) window • Compute the squared error between raw series and its smoothed version • Compute a new effective standard deviation (sigma’) by smoothing the squared error • Generate a moving threshold: – Outliers are 3-sigma’ outside the new, smoothed data! • Ta-da!
  • 23. 23 Simple and Weighted Moving Averages • Simple Moving Average – Average of last N values in your time series • S[t] <- sum(X[t-(N-1):t])/N – Each value in the window contributes equally to prediction – …INCLUDING spikes and outliers • Weigthed Moving Average – Similar to SMA but assigns linearly (arithmetically) decreasing weights to every value in the window – Older values contribute less to the prediction
  • 24. 24 Exponential Smoothing techniques • Exponential Smoothing – Similar to weighted average, but with weights decay exponentially over the whole set of historic samples • S[t]=αX[t-1] + (1-α)S[t-1] – Does not deal with trends in data • DES – In addition to data smoothing factor (α), introduces a trend smoothing factor (β) – Better at dealing with trending – Does not deal with seasonality in data • TES, Holt-Winters – Introduces additional seasonality factor – … and so on
  • 29. 29 Hmmmm, so are we doomed? • No! • ALL smoothing predictive methods work best with normally distributed data! • But there are lots of other non-Gaussian based techniques – We can only scratch the surface in this talk
  • 34. 34 Trick #2: Kolmogorov-Smirnov test • Non-parametric test – Compare two probability distributions – Makes no assumptions (e.g. Gaussian) about the distributions of the samples – Measures maximum distance between cumulative distributions – Can be used to compare periodic/seasonal metric periods (e.g. day-to-day or week-to-week) http://en.wikipedia.org/wiki/Kolmogorov%E2% 80%93Smirnov_test
  • 36. 36
  • 37. 37
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41 KS Test on difficult data
  • 42. 42 Trick #3: Box Plots / Tukey • Again, need non-parametric method: – Does not rely on mean and standard deviation • When you can’t count on good old Gaussian: – Median is always a great alternative to the mean – Quartiles are an alternative to standard deviation • Q1 = 25% Quartile (25% of the data) • Q2 = 50% Quartile == Median (50% of the data) • Q3 = 75% Quartile (75% of the data) • Interquartile Range (IQR) = Q3 – Q1
  • 43. 43 Example: box plots and fences for a Gaussian http://en.wikipedia.org/wiki/Interquartile_range
  • 44. 44 IQR method for streaming time series • IQR method works well for some non-normal distributions – Generates continuously adaptive fences at (Q1 - 1.5xIRQ) and (Q3 + 1.5xIQR) – Adjusted box plot uses fences at (Q1 - 1.5xIRQ) and (Q3 + 1.5x IQR) • Method: – As time series is streaming, for every window: • Re-compute quartiles • Re-compute IQR, fences • Determine if any outliers • Repeat
  • 49. 49 Trick #4: Diffing/Derivatives • Often, even when the data itself is not stationary, its derivatives tends to be! • Most frequently, first difference is sufficient: dS(t) <- S(t+1) – S(t) • Can then perform some analytics on first difference
  • 51. 51 Its first difference – possible random walk?
  • 52. 52 Trick #5: Neural Networks • Really? • No time – To Be Continued!
  • 53. 53 We’re not doomed, but: Know your data!! • You need to understand the statistical properties of your data, and where it comes from, in order to determine what kind of analytics to use. – Your data is very important! – You spend time collecting it so spend time analyzing it! • A large amount of data center data is non- Gaussian – Guassian statistics won’t work – Use appropriate techniques
  • 54. 54 More? • Only scratched the surface • I want to talk more about algorithms, analytics, current issues, etc, in more depth, but time’s up!! – Come talk to me or email me if interested. • Office Hour: Tomorrow at 11:30 Booth #801 • Thank you! toufic@metaforsoftware.com @tboubez
  • 55. 55 Oh yeah, and we’re hiring! In Vancouver, Canada