SlideShare a Scribd company logo
1 of 30
Download to read offline
Streaming Outlier Analysis for Fun and Scalability
Casey Stella
2016
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Table of Contents
Streaming Analytics
Framework
Demos
Questions
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The future involves non-trivial analytics done on streaming data
• It’s not just IoT
• There is a need for insights to keep pace with the velocity of your data
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
• The Good: There is no shortage of computational frameworks to handle streaming
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Streaming Analytics
• The Good: Much of the data can be coerced into timeseries
• The Bad: There is a lot of data and it comes at you fast
• The Good: Outlier analysis or anomaly detection is a killer-app
• The Bad: Outlier analysis can be computationally intensive
• The Good: There is no shortage of computational frameworks to handle streaming
• The Bad: There are not an overabundance of high-quality outlier analysis
frameworks
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which “interesting”
points are differentiated from “normal” points. Often “interesting” implies some sort of
error or state which should be researched further.
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis
Outlier analysis or anomaly detection is the analytical technique by which “interesting”
points are differentiated from “normal” points. Often “interesting” implies some sort of
error or state which should be researched further.
Macrobase1, an outlier analysis system built for IoT by MIT and Stanford and
Cambridge Mobile Telematics, noted several properties of IoT data:
• Data produced by IoT applications often have come from some “ordinary”
distribution
• IoT anomalies are often systemic
• They are often fairly rare
1
http://arxiv.org/pdf/1603.00567v1.pdf
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
◦ Expensive computationally, but run infrequently
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Outlier Analysis: A Hybrid Approach
In order to function at scale, a two-phase approach is taken
• For every data point
◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute
deviation) that uses distributional sketching (e.g. Q-trees)
◦ Gather a biased sample (biased by recency)
◦ Extremely deterministic in space and cheap in computation
• For every outlier candidate
◦ Use traditional, more computationally complex approaches to outlier analysis (e.g.
Robust PCA) on the biased sample
◦ Expensive computationally, but run infrequently
This becomes a data filter which can be attached to a timeseries data stream
within a distributed computational framework (i.e. Storm, Spark, Flink, NiFi)
to detect outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
• A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Sketchy Outlier Estimator: Median Absolute Deviation
• Median absolute deviation (or MAD) is a robust statistic
◦ Robust statistics are statistics with good performance for data drawn from a wide range
of non-normally distributed probability distributions
◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the
presence of outliers.
• The median absolute deviation is defined for a series of univariate samples X with
˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}).
• A point is considered an outlier if its distance from the current window median,
scaled by the MAD for the previous window, is above a threshold.
tl;dr: A formal way to encode our intuition: If a point is far away from the
“central” point of our window, then it’s likely an outlier.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
• Aimed at many different one-dimensional data streams instead of outliers in
multidimensional data streams.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
• Aimed at many different one-dimensional data streams instead of outliers in
multidimensional data streams.
• Because probabalistic sketches are extremely compact, you can look much farther
back for your context than a naive windowing solution
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Architecture
This kind of architecture has a few characteristics that are interesting
• Aimed primarily at many different low to medium velocity time series data
• Aimed at many different one-dimensional data streams instead of outliers in
multidimensional data streams.
• Because probabalistic sketches are extremely compact, you can look much farther
back for your context than a naive windowing solution
• Send outliers (lower velocity and number) and send raw time series to a TSDB
capable of handling scale. Investigate the data via a dashboard that can marry the
two into a single pane of glass.
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Demos
Demos
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available at
http://github.com/cestella/streaming_outliers
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016

More Related Content

Viewers also liked

Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Spark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 

Viewers also liked (20)

Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Kaarthik Sivashanmugam
 
Spark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the streamSpark Summit EU talk by Miklos Christine paddling up the stream
Spark Summit EU talk by Miklos Christine paddling up the stream
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa Sawyer
 
Spark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus GoehausenSpark Summit EU talk by Nimbus Goehausen
Spark Summit EU talk by Nimbus Goehausen
 
Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 

Similar to Spark Summit EU talk by Casey Stella

Slides for "Do Deep Generative Models Know What They Don't know?"
Slides for "Do Deep Generative Models Know What They Don't know?"Slides for "Do Deep Generative Models Know What They Don't know?"
Slides for "Do Deep Generative Models Know What They Don't know?"Julius Hietala
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Linked science presentation 25
Linked science presentation 25Linked science presentation 25
Linked science presentation 25Francesco Osborne
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Henock Beyene
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...tboubez
 
Stance classification. Presentation at QMUL 11 Nov 2020
Stance classification. Presentation at QMUL 11 Nov 2020Stance classification. Presentation at QMUL 11 Nov 2020
Stance classification. Presentation at QMUL 11 Nov 2020Weverify
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detectionShantanuDeosthale
 
Stance classification. Uni Cambridge 22 Jan 2021
Stance classification. Uni Cambridge 22 Jan 2021Stance classification. Uni Cambridge 22 Jan 2021
Stance classification. Uni Cambridge 22 Jan 2021Weverify
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronDataWorks Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic VisualizationSri Ambati
 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptxShree Shree
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptxPallabiSahoo5
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
MRMW N America 2016 presentation kelly and zanutto naxion
MRMW N America 2016 presentation kelly and zanutto naxionMRMW N America 2016 presentation kelly and zanutto naxion
MRMW N America 2016 presentation kelly and zanutto naxionMichael Kelly
 

Similar to Spark Summit EU talk by Casey Stella (20)

Slides for "Do Deep Generative Models Know What They Don't know?"
Slides for "Do Deep Generative Models Know What They Don't know?"Slides for "Do Deep Generative Models Know What They Don't know?"
Slides for "Do Deep Generative Models Know What They Don't know?"
 
Data Preparation for Data Science
Data Preparation for Data ScienceData Preparation for Data Science
Data Preparation for Data Science
 
Data Preparation of Data Science
Data Preparation of Data ScienceData Preparation of Data Science
Data Preparation of Data Science
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Linked science presentation 25
Linked science presentation 25Linked science presentation 25
Linked science presentation 25
 
Statistics
StatisticsStatistics
Statistics
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
Velocity Europe 2013: Beyond Pretty Charts: Analytics for the cloud infrastru...
 
Stance classification. Presentation at QMUL 11 Nov 2020
Stance classification. Presentation at QMUL 11 Nov 2020Stance classification. Presentation at QMUL 11 Nov 2020
Stance classification. Presentation at QMUL 11 Nov 2020
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Stance classification. Uni Cambridge 22 Jan 2021
Stance classification. Uni Cambridge 22 Jan 2021Stance classification. Uni Cambridge 22 Jan 2021
Stance classification. Uni Cambridge 22 Jan 2021
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Automatic Visualization
Automatic VisualizationAutomatic Visualization
Automatic Visualization
 
11-Statistical-Tests.pptx
11-Statistical-Tests.pptx11-Statistical-Tests.pptx
11-Statistical-Tests.pptx
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
MRMW N America 2016 presentation kelly and zanutto naxion
MRMW N America 2016 presentation kelly and zanutto naxionMRMW N America 2016 presentation kelly and zanutto naxion
MRMW N America 2016 presentation kelly and zanutto naxion
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...BabaJohn3
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfMichaelSenkow
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeralNABLAS株式会社
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一w7jl3eyno
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一0uyfyq0q4
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 

Recently uploaded (20)

2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
社内勉強会資料  Mamba - A new era or ephemeral
社内勉強会資料   Mamba - A new era or ephemeral社内勉強会資料   Mamba - A new era or ephemeral
社内勉強会資料  Mamba - A new era or ephemeral
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
如何办理澳洲悉尼大学毕业证(USYD毕业证书)学位证书成绩单原版一比一
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 

Spark Summit EU talk by Casey Stella

  • 1. Streaming Outlier Analysis for Fun and Scalability Casey Stella 2016 Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 2. Table of Contents Streaming Analytics Framework Demos Questions Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 4. Streaming Analytics • The future involves non-trivial analytics done on streaming data • It’s not just IoT • There is a need for insights to keep pace with the velocity of your data Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 5. Streaming Analytics • The Good: Much of the data can be coerced into timeseries Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 6. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 7. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 8. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 9. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive • The Good: There is no shortage of computational frameworks to handle streaming Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 10. Streaming Analytics • The Good: Much of the data can be coerced into timeseries • The Bad: There is a lot of data and it comes at you fast • The Good: Outlier analysis or anomaly detection is a killer-app • The Bad: Outlier analysis can be computationally intensive • The Good: There is no shortage of computational frameworks to handle streaming • The Bad: There are not an overabundance of high-quality outlier analysis frameworks Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 11. Outlier Analysis Outlier analysis or anomaly detection is the analytical technique by which “interesting” points are differentiated from “normal” points. Often “interesting” implies some sort of error or state which should be researched further. 1 http://arxiv.org/pdf/1603.00567v1.pdf Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 12. Outlier Analysis Outlier analysis or anomaly detection is the analytical technique by which “interesting” points are differentiated from “normal” points. Often “interesting” implies some sort of error or state which should be researched further. Macrobase1, an outlier analysis system built for IoT by MIT and Stanford and Cambridge Mobile Telematics, noted several properties of IoT data: • Data produced by IoT applications often have come from some “ordinary” distribution • IoT anomalies are often systemic • They are often fairly rare 1 http://arxiv.org/pdf/1603.00567v1.pdf Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 13. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 14. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 15. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 16. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 17. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample ◦ Expensive computationally, but run infrequently Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 18. Outlier Analysis: A Hybrid Approach In order to function at scale, a two-phase approach is taken • For every data point ◦ Detect outlier candidates using a robust estimator of variability (e.g. median absolute deviation) that uses distributional sketching (e.g. Q-trees) ◦ Gather a biased sample (biased by recency) ◦ Extremely deterministic in space and cheap in computation • For every outlier candidate ◦ Use traditional, more computationally complex approaches to outlier analysis (e.g. Robust PCA) on the biased sample ◦ Expensive computationally, but run infrequently This becomes a data filter which can be attached to a timeseries data stream within a distributed computational framework (i.e. Storm, Spark, Flink, NiFi) to detect outliers. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 19. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 20. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 21. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 22. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). • A point is considered an outlier if its distance from the current window median, scaled by the MAD for the previous window, is above a threshold. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 23. Sketchy Outlier Estimator: Median Absolute Deviation • Median absolute deviation (or MAD) is a robust statistic ◦ Robust statistics are statistics with good performance for data drawn from a wide range of non-normally distributed probability distributions ◦ Unlike the standard mean/standard deviation combo, MAD is not sensitive to the presence of outliers. • The median absolute deviation is defined for a series of univariate samples X with ˜x =median(X), MAD(X)=median({∀xi ∈ X||xi − ˜x|}). • A point is considered an outlier if its distance from the current window median, scaled by the MAD for the previous window, is above a threshold. tl;dr: A formal way to encode our intuition: If a point is far away from the “central” point of our window, then it’s likely an outlier. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 24. Architecture Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 25. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 26. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data • Aimed at many different one-dimensional data streams instead of outliers in multidimensional data streams. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 27. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data • Aimed at many different one-dimensional data streams instead of outliers in multidimensional data streams. • Because probabalistic sketches are extremely compact, you can look much farther back for your context than a naive windowing solution Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 28. Architecture This kind of architecture has a few characteristics that are interesting • Aimed primarily at many different low to medium velocity time series data • Aimed at many different one-dimensional data streams instead of outliers in multidimensional data streams. • Because probabalistic sketches are extremely compact, you can look much farther back for your context than a naive windowing solution • Send outliers (lower velocity and number) and send raw time series to a TSDB capable of handling scale. Investigate the data via a dashboard that can marry the two into a single pane of glass. Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 29. Demos Demos Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016
  • 30. Questions Thanks for your attention! Questions? • Code & scripts for this talk available at http://github.com/cestella/streaming_outliers • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com Casey Stella (Hortonworks) Streaming Outlier Analysis for Fun and Scalability 2016