SlideShare a Scribd company logo
1 of 30
Download to read offline
Streaming Outlier Analysis for Fun and Scalability
Casey Stella
2016
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Table of Contents
Streaming Analytics
Framework
Demos
Questions
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The future involves non-trivial analytics done on streaming data
• It’s not just IoT
• There is a need for insights to keep pace with the velocity of your data
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
• The Good: There is no shortage of computational frameworks to handle streaming
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
• The Good: There is no shortage of computational frameworks to handle streaming
• The Bad: There are not an overabundance of high-quality outlier analysis
frameworks
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which “interesting”
points are differentiated from “normal” points. Often “interesting” implies some sort of
error or state which should be researched further.
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which “interesting”
points are differentiated from “normal” points. Often “interesting” implies some sort of
error or state which should be researched further.
Macrobase1, an outlier analysis system built for IoT by MIT and Stanford and
Cambridge Mobile Telematics, noted several properties of IoT data:
• Data produced by IoT applications often have come from some “ordinary”
distribution
• IoT anomalies are often systemic
• They are often fairly rare
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
◦ Expensive computationally, but run infrequently
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
◦ Expensive computationally, but run infrequently
This becomes a data filter which can be attached to a timeseries data stream
within a distributed computational framework (i.e. Storm, Spark, Flink, NiFi)
to detect outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
• A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
• A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
tl;dr: A formal way to encode our intuition: If a point is far away from the
“central” point of our window, then it’s likely an outlier.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
• Aimed at many different one-dimensional data streams instead of outliers in
multidimensional data streams.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
• Aimed at many different one-dimensional data streams instead of outliers in
multidimensional data streams.
• Because probabalistic sketches are extremely compact, you can look much farther
back for your context than a naive windowing solution
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
• Aimed at many different one-dimensional data streams instead of outliers in
multidimensional data streams.
• Because probabalistic sketches are extremely compact, you can look much farther
back for your context than a naive windowing solution
• Send outliers (lower velocity and number) and send raw time series to a TSDB
capable of handling scale. Investigate the data via a dashboard that can marry the
two into a single pane of glass.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Demos
Demos
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available at
http://github.com/cestella/streaming_outliers
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016

More Related Content

Viewers also liked

Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit
 

Viewers also liked (20)

Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa Sawyer
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 

Similar to Spark Summit EU talk by Casey Stella

Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
Neeraj Tewari
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
Henock Beyene
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
tboubez
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic Visualization
Sri Ambati
 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptx
Shree Shree
 

Similar to Spark Summit EU talk by Casey Stella (20)

Slides for "Do Deep Generative Models Know What They Don't know?"
Slides for "Do Deep Generative Models Know What They Don't know?"Slides for "Do Deep Generative Models Know What They Don't know?"
Slides for "Do Deep Generative Models Know What They Don't know?"
 
Data Preparation for Data Science
Data Preparation for Data ScienceData Preparation for Data Science
Data Preparation for Data Science
 
Data Preparation of Data Science
Data Preparation of Data ScienceData Preparation of Data Science
Data Preparation of Data Science
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Linked science presentation 25
Linked science presentation 25Linked science presentation 25
Linked science presentation 25
 
Statistics
StatisticsStatistics
Statistics
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
 
Stance classification. Presentation at QMUL 11 Nov 2020
Stance classification. Presentation at QMUL 11 Nov 2020Stance classification. Presentation at QMUL 11 Nov 2020
Stance classification. Presentation at QMUL 11 Nov 2020
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Stance classification. Uni Cambridge 22 Jan 2021
Stance classification. Uni Cambridge 22 Jan 2021Stance classification. Uni Cambridge 22 Jan 2021
Stance classification. Uni Cambridge 22 Jan 2021
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic Visualization
 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptx
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
MRMW N America 2016 presentation kelly and zanutto naxion
MRMW N America 2016 presentation kelly and zanutto naxionMRMW N America 2016 presentation kelly and zanutto naxion
MRMW N America 2016 presentation kelly and zanutto naxion
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 

Recently uploaded (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 

Spark Summit EU talk by Casey Stella

  • 1. Streaming Outlier Analysis for Fun and Scalability Casey Stella 2016 Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 2. Table of Contents Streaming Analytics Framework Demos Questions Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 4. Streaming Analytics • The future involves non-trivial analytics done on streaming data • It’s not just IoT • There is a need for insights to keep pace with the velocity of your data Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 5. Streaming Analytics • The Good: Much of the data can be coerced into timeseries Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 6. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 7. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 8. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 9. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive • The Good: There is no shortage of computational frameworks to handle streaming Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 10. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive • The Good: There is no shortage of computational frameworks to handle streaming • The Bad: There are not an overabundance of high-quality outlier analysis frameworks Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 11. Outlier Analysis Outlier analysis or anomaly detection is the analytical technique by which “interesting” points are differentiated from “normal” points. Often “interesting” implies some sort of error or state which should be researched further. 1 http://arxiv.org/pdf/1603.00567v1.pdf Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 12. Outlier Analysis Outlier analysis or anomaly detection is the analytical technique by which “interesting” points are differentiated from “normal” points. Often “interesting” implies some sort of error or state which should be researched further. Macrobase1, an outlier analysis system built for IoT by MIT and Stanford and Cambridge Mobile Telematics, noted several properties of IoT data: • Data produced by IoT applications often have come from some “ordinary” distribution • IoT anomalies are often systemic • They are often fairly rare 1 http://arxiv.org/pdf/1603.00567v1.pdf Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 13. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 14. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 15. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 16. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 17. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample ◦ Expensive computationally, but run infrequently Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 18. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample ◦ Expensive computationally, but run infrequently This becomes a data filter which can be attached to a timeseries data stream within a distributed computational framework (i.e. Storm, Spark, Flink, NiFi) to detect outliers. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 19. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 20. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 21. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 22. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). • A point is considered an outlier if its distance from the current window median, scaled by the MAD for the previous window, is above a threshold. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 23. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). • A point is considered an outlier if its distance from the current window median, scaled by the MAD for the previous window, is above a threshold. tl;dr: A formal way to encode our intuition: If a point is far away from the “central” point of our window, then it’s likely an outlier. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 24. Architecture Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 25. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 26. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data • Aimed at many different one-dimensional data streams instead of outliers in multidimensional data streams. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 27. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data • Aimed at many different one-dimensional data streams instead of outliers in multidimensional data streams. • Because probabalistic sketches are extremely compact, you can look much farther back for your context than a naive windowing solution Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 28. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data • Aimed at many different one-dimensional data streams instead of outliers in multidimensional data streams. • Because probabalistic sketches are extremely compact, you can look much farther back for your context than a naive windowing solution • Send outliers (lower velocity and number) and send raw time series to a TSDB capable of handling scale. Investigate the data via a dashboard that can marry the two into a single pane of glass. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 29. Demos Demos Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 30. Questions Thanks for your attention! Questions? • Code & scripts for this talk available at http://github.com/cestella/streaming_outliers • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016