SIEMENS: RAPUNZEL ā A Tale About Knowledge Graph
Ā
Streaming Outlier Analysis for Fun and Scalability
1. Streaming Outlier Analysis for Fun and Scalability
Casey Stella
2016
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
2. Table of Contents
Streaming Analytics
Framework
Demos
Questions
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
3. Introduction
Hi, Iām Casey Stella!
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
4. Streaming Analytics
ā¢ The future involves non-trivial analytics done on streaming data
ā¢ Itās not just IoT
ā¢ There is a need for insights to keep pace with the velocity of your data
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
5. Streaming Analytics
ā¢ The Good: Much of the data can be coerced into timeseries
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
6. Streaming Analytics
ā¢ The Good: Much of the data can be coerced into timeseries
ā¢ The Bad: There is a lot of data and it comes at you fast
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
7. Streaming Analytics
ā¢ The Good: Much of the data can be coerced into timeseries
ā¢ The Bad: There is a lot of data and it comes at you fast
ā¢ The Good: Outlier analysis or anomaly detection is a killer-app
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
8. Streaming Analytics
ā¢ The Good: Much of the data can be coerced into timeseries
ā¢ The Bad: There is a lot of data and it comes at you fast
ā¢ The Good: Outlier analysis or anomaly detection is a killer-app
ā¢ The Bad: Outlier analysis can be computationally intensive
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
9. Streaming Analytics
ā¢ The Good: Much of the data can be coerced into timeseries
ā¢ The Bad: There is a lot of data and it comes at you fast
ā¢ The Good: Outlier analysis or anomaly detection is a killer-app
ā¢ The Bad: Outlier analysis can be computationally intensive
ā¢ The Good: There is no shortage of computational frameworks to handle streaming
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
10. Streaming Analytics
ā¢ The Good: Much of the data can be coerced into timeseries
ā¢ The Bad: There is a lot of data and it comes at you fast
ā¢ The Good: Outlier analysis or anomaly detection is a killer-app
ā¢ The Bad: Outlier analysis can be computationally intensive
ā¢ The Good: There is no shortage of computational frameworks to handle streaming
ā¢ The Bad: There are not an overabundance of high-quality outlier analysis
frameworks
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
11. Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which āinterestingā
points are diļ¬erentiated from ānormalā points. Often āinterestingā implies some sort of
error or state which should be researched further.
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
12. Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which āinterestingā
points are diļ¬erentiated from ānormalā points. Often āinterestingā implies some sort of
error or state which should be researched further.
Macrobase1, an outlier analysis system built for IoT by MIT and Stanford and
Cambridge Mobile Telematics, noted several properties of IoT data:
ā¢ Data produced by IoT applications often have come from some āordinaryā
distribution
ā¢ IoT anomalies are often systemic
ā¢ They are often fairly rare
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
13. Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
ā¢ For every data point
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
14. Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
ā¢ For every data point
ā¦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
ā¦ Gather a biased sample (biased by recency)
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
15. Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
ā¢ For every data point
ā¦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
ā¦ Gather a biased sample (biased by recency)
ā¦ Extremely deterministic in space and cheap in computation
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
16. Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
ā¢ For every data point
ā¦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
ā¦ Gather a biased sample (biased by recency)
ā¦ Extremely deterministic in space and cheap in computation
ā¢ For every outlier candidate
ā¦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
17. Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
ā¢ For every data point
ā¦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
ā¦ Gather a biased sample (biased by recency)
ā¦ Extremely deterministic in space and cheap in computation
ā¢ For every outlier candidate
ā¦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
ā¦ Expensive computationally, but run infrequently
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
18. Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
ā¢ For every data point
ā¦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
ā¦ Gather a biased sample (biased by recency)
ā¦ Extremely deterministic in space and cheap in computation
ā¢ For every outlier candidate
ā¦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
ā¦ Expensive computationally, but run infrequently
This becomes a data ļ¬lter which can be attached to a timeseries data stream
within a distributed computational framework (i.e. Storm, Spark, Flink, NiFi)
to detect outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
19. Sketchy Outlier Estimator: Median Absolute Deviation
ā¢ Median absolute deviation (or MAD) is a robust statistic
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
20. Sketchy Outlier Estimator: Median Absolute Deviation
ā¢ Median absolute deviation (or MAD) is a robust statistic
ā¦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
ā¦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
21. Sketchy Outlier Estimator: Median Absolute Deviation
ā¢ Median absolute deviation (or MAD) is a robust statistic
ā¦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
ā¦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
ā¢ The median absolute deviation is deļ¬ned for a series of univariate samples X with
Ėx =median(X), MAD(X)=median({āxi ā X||xi ā Ėx|}).
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
22. Sketchy Outlier Estimator: Median Absolute Deviation
ā¢ Median absolute deviation (or MAD) is a robust statistic
ā¦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
ā¦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
ā¢ The median absolute deviation is deļ¬ned for a series of univariate samples X with
Ėx =median(X), MAD(X)=median({āxi ā X||xi ā Ėx|}).
ā¢ A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
23. Sketchy Outlier Estimator: Median Absolute Deviation
ā¢ Median absolute deviation (or MAD) is a robust statistic
ā¦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
ā¦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
ā¢ The median absolute deviation is deļ¬ned for a series of univariate samples X with
Ėx =median(X), MAD(X)=median({āxi ā X||xi ā Ėx|}).
ā¢ A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
tl;dr: A formal way to encode our intuition: If a point is far away from the
ācentralā point of our window, then itās likely an outlier.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
26. Questions
Thanks for your attention! Questions?
ā¢ Code & scripts for this talk available at
http://github.com/cestella/streaming_outliers
ā¢ Find me at http://caseystella.com
ā¢ Twitter handle: @casey_stella
ā¢ Email address: cstella@hortonworks.com
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016