Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The definition of normal - An introduction and guide to anomaly detection.

1,885 views

Published on

What is normal behaviour?
How are expectations about future behaviour derived from data?
How do anomaly detection algorithms work including trending and seasonality?
How do these algorithms know whether something is an anomaly?
Which algorithms can be used for which type of data?

Published in: Software

The definition of normal - An introduction and guide to anomaly detection.

  1. 1. ruxit theme 2014.05.15 The definition of normal An introduction and guide to anomaly detection Alois Reitbauer, ruxit @aloisreitbauer
  2. 2. ruxit theme 2014.05.15 Some background Who I am and what I do
  3. 3. ruxit theme 2014.05.15
  4. 4. ruxit theme 2014.05.15
  5. 5. ruxit theme 2014.05.15 Anomaly Detection What is an anomaly anyways?
  6. 6. ruxit theme 2014.05.15 What is an anomaly? In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as ………. Source: Wikipedia
  7. 7. ruxit theme 2014.05.15 How many metrics would we have to look at 3 Metrics per Service 5 Metrics per Host 5 Metrics per Runtime 40 Services = 120 Metrics 20 Hosts = 100 Metrics 40 Runtimes = 200 Metrics 420Metrics
  8. 8. ruxit theme 2014.05.15 We cannot watch 400+ metrics So we need to find ways to automate finding anomalies
  9. 9. ruxit theme 2014.05.15 Historic Data “Normal” Model New Data Hypothes is Likeliness Judgemen t update calculate derive test produces Anomaly? defines Anomaly Detection Workflow
  10. 10. ruxit theme 2014.05.15 We will look at three types of data Response Times Did our response times increase significantly? Error Rates Did the error rate of any of our services change? Load Is there anything unusual happening to our service load?
  11. 11. ruxit theme 2014.05.15 Finding error rate anomalies Are we having more errors than usual?
  12. 12. ruxit theme 2014.05.15 How can we get our baseline? Average or Mean Easy to calculate but does not learn over time Median Needs more raw data as average, precise. Does not learn well either Exponential Smoothing Easy to calculate and learns over time
  13. 13. ruxit theme 2014.05.15 Using exponential smoothing for baseline Source: Wikipedia
  14. 14. ruxit theme 2014.05.15 Example
  15. 15. ruxit theme 2014.05.15 Is this an anomaly? Our Observation: Typical error of 3 percent at 10,000 transactions/min Current System Behavior: During night we see 5 errors in 100 requests
  16. 16. ruxit theme 2014.05.15 Binomial Distribution Tells us how likely it is to see n successes in a certain number of trials
  17. 17. ruxit theme 2014.05.15 0.0% 20.0% 40.0% 60.0% 80.0% 100.0% 120.0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Likeliness of at least n errors 18 % probability to see 5 or more errors Applying Binomial Distribution to our problem.
  18. 18. ruxit theme 2014.05.15 Derive an anomaly from a forecast What is unlikely enough to be interpreted as an anomaly?
  19. 19. ruxit theme 2014.05.15 95 % Probability Window Borrowing from the Standard Deviation
  20. 20. ruxit theme 2014.05.15 Response Time Anomalies Are our response times higher than usual?
  21. 21. ruxit theme 2014.05.15 Challenges in finding response time anomalies
  22. 22. ruxit theme 2014.05.15 Data representation is important
  23. 23. ruxit theme 2014.05.15 Proper data representation with Median
  24. 24. ruxit theme 2014.05.15 Mean: 500 ms Std. Dev.: 100 ms 68 % 400ms – 600 ms 95 % 300ms – 700 ms 0 100 200 300 400 500 600 700 800 900 99 % 200ms – 800 ms If our data would be normally distributed …
  25. 25. ruxit theme 2014.05.15 50 Percent slower than μ 97.6 Percent slower than μ + 2σ Median 97th Percentile However, we can generalize the model
  26. 26. ruxit theme 2014.05.15 Is this an anomaly? Our Observation: Usually we see median response time of 300 ms. Current System Behavior: During night with low traffic response times goup to 600 ms.
  27. 27. ruxit theme 2014.05.15 Our median response time is 300 ms and we measure 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms Testing against new data
  28. 28. ruxit theme 2014.05.15 Check all values above 300 ms 200 ms 400 ms 350 ms 200 ms 600 ms 500 ms 150 ms 350 ms 400 ms 600 ms 7 values are higher than the median. Is this normal? Using Binomial distribution on median
  29. 29. ruxit theme 2014.05.15 We have a 50 percent likeliness to see values above the median. How likely is is that 7 out of 10 samples are higher?The probability is 17 percent, so we should not alert. Applying percentile drift detection
  30. 30. ruxit theme 2014.05.15 Load Anomalies Are we seeing unusually high or low load?
  31. 31. ruxit theme 2014.05.15 We will look at three types of data Seasonality Load is often directly related to time-based usage. Trend Growth patterns are not necessarily source of a problem. We need a different approach
  32. 32. ruxit theme 2014.05.15 Holt-Winters Seasonal Forecasting
  33. 33. ruxit theme 2014.05.15 Example
  34. 34. ruxit theme 2014.05.15 Causality Analysis of Anomalies How to derive meaningful information from anomalies.
  35. 35. ruxit theme 2014.05.15 Anomalies vs. Health Anomaly A system does not expose the expected behavior. Health A system does not operate within well-defined boundaries.
  36. 36. ruxit theme 2014.05.15 Health and Anomaly Matrix Healthy Unhealthy No Anomalies Operating normally Unstable System Anomalies Resilient Operational issues
  37. 37. ruxit theme 2014.05.15 Judging Anomalies by Impact 1st Degree Anomaly - CPU Saturation on a host - or similar 2nd Degree Anomaly - Application Functionality affected 3rd Degree Anomaly - Externally visible effects – User realize
  38. 38. ruxit theme 2014.05.15 Relationships of anomalies Transferring system knowledge to monitoring systems
  39. 39. ruxit theme 2014.05.15The model
  40. 40. ruxit theme 2014.05.15 Interpretation with expert knowledge Strong Relationship Response time slow down impacted by CPU saturation Potential Relationship Response time slow down potentially impacted by code deployment No Relationship CPU saturation not impacted by load drop
  41. 41. ruxit theme 2014.05.15 Distinguish Impact from Cause How to infer root cause information from monitoring data
  42. 42. ruxit theme 2014.05.15 Automated Analysis of Problems Service slowdown
  43. 43. ruxit theme 2014.05.15 Automated Analysis of Problems Service slowdown Dependent services slow down
  44. 44. ruxit theme 2014.05.15 Automated Analysis of Problems Service slow down Dependent service slow down Users are affected
  45. 45. ruxit theme 2014.05.15 Automated Analysis of Problems Service slow down Dependent service slow down Users are affected Analyze Dependencies
  46. 46. ruxit theme 2014.05.15 Automated Analysis of Problems Service slow down Dependent service slow down Users are affected Analyze Dependencies Exclude non-relevant services
  47. 47. ruxit theme 2014.05.15 Automated Analysis of Problems Service slow down Dependent service slow down Users are affected Analyze Dependencies Exclude non-relevant services Follow causality chain
  48. 48. ruxit theme 2014.05.15 Automated Analysis of Problems Service slow down Dependent service slow down Users are affected Analyze Dependencies Exclude non-relevant services Follow causality chain
  49. 49. ruxit theme 2014.05.15 Real World Example
  50. 50. ruxit theme 2014.05.15 Alois Reitbauer @aloisreitbauer alois.reitbauer@ruxit.com blog.ruxit.com

×