SlideShare a Scribd company logo
1 of 20
Online Change Point
Detection Using Spark
Streaming
Michal Monselise
PyData Seattle 2017
A Bit About Me
 Senior Data Scientist @ DS-IQ
 Instructor at UW PCE
Applications of Change Point Detection
Genomics
Marketing
Finance
Genomics
 Change points in genomics can help identify genes that are
damaged
 We can look at the change points in aCGH profiles of genes to
identify their involvement in cancer or other diseases
Marketing
Finance
Streaming
 Most change point detection algorithms give us a static analysis
of the change points in the data
 However, with the advances in streaming, there is a greater
need for real time analysis for change points
 This project utilizes Spark Streaming since it is best suited for
this use case
Methodology
 There are two different ways of performing online change point
detection
 Non parametric
 Bayesian
𝑖𝑓 𝑑 𝑊𝑖𝑛𝑑𝑜𝑤1, 𝑊𝑖𝑛𝑑𝑜𝑤2 > 𝛼 𝑡ℎ𝑒𝑛 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑎 𝑐ℎ𝑎𝑛𝑔𝑒 𝑝𝑜𝑖𝑛𝑡
It is up to us to define d
Non Parametric Algorithm
 Bayes Theorem:
 This means that we create a posterior distribution that is
proportional to the product of the prior distribution (the initial
assumption) and the likelihood (the observed data)
Bayesian Statistics
Bayesian Algorithm
 Using a Bayesian algorithm means that we will update the
distribution after every new observation and obtain a new
posterior distribution
 This solution works well with real time change point detection
Time Between Change Points
 The main idea here is that instead of modeling the data, we model the
time till a change point.
 𝑃 𝑟𝑡 𝑟𝑡−1 =
𝐻 𝑟𝑡−1 + 1
1 − 𝐻 𝑟𝑡−1 + 1
0
𝑖𝑓 𝑟𝑡 = 0
𝑖𝑓 𝑟𝑡 = 𝑟𝑡−1 + 1
𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 Where 𝐻 𝜏 is the hazard function 𝐻 𝜏 =
𝑃 𝑔𝑎𝑝(𝑔=𝜏)
𝑡=𝜏
∞ 𝑃 𝑔𝑎𝑝(𝑔=𝑡)
 𝑃𝑔𝑎𝑝 is the discrete a priori probability distribution over the interval
between change points
Limitations of Change Point Detection in
Streaming
 This algorithm is an improvement since it accounts for change
points in real time rather than identifying change points after
the fact
 However, we still need to store all of the previous observations
in order to compute the probability of a change point in real
time
 In a streaming scenario, this can get out of hand
 We do not expect to use all possible data to create a posterior
distribution
 For example, we do not expect to have all possible temperature
measurements from the dawn of time when we try to evaluate
global warming
Limitations of Change Point Detection in
Streaming
 In Spark Streaming, we can only retain data between micro
batches using an accumulator variable
 We can use a default numeric accumulator or create our own
accumulator class
Accumulators in Spark
Accumulators in Spark
 Therefore we create an accumulator queue with a size limit
 We use this queue to compute the posterior distribution using
only the latest MAX_QUEUE_SIZE observations
 We would like to measure when it is hot or cold
 One approach is to find when weather shifts abruptly
 Change Point Detection is a good methodology to find these
shifts
Case Study – Weather
 We look at the temperature for zip code 98006 (Bellevue, WA)
between September 2016 and February 2017
 Identifying the change points in the data will help us identify
abrupt temperature shifts
Case Study – Weather
Case Study – Weather
 Our hypothesis is that sudden changes in weather influence
consumer behavior more than absolute temperatures
 We also think that these changes influence behavior more than
relative temperature (number of standard deviations from the
mean)
References
 Adams RP, MacKay DJC. Bayesian online change point
detection. University of Cambridge Technical Report; 2007.
 Kifer D., Ben-David S, and Gehrke J. Detecting change in data
streams. In Proceedings of the International Conference on Very
Large Data Bases, Toronto, Canada, pp. 180–191; 2004.
 Muggeo VMR, Adelfio G. Efficient change point detection for
genomic sequences of continuous measurements.
Bioinformatics 27 (2): 161-166; 2011.

More Related Content

Similar to Michal Monselise - Online change point detection using spark streaming

V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsElinor Velasquez
 
Exponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLEExponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLEIOSR Journals
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetIJCERT
 
Internship_presentation
Internship_presentationInternship_presentation
Internship_presentationAditya Gautam
 
PPT.pdf internship demo on machine lerning
PPT.pdf internship demo on machine lerningPPT.pdf internship demo on machine lerning
PPT.pdf internship demo on machine lerningMisbanausheen1
 
IEEE Fuzzy system Title and Abstract 2016
IEEE Fuzzy system Title and Abstract 2016 IEEE Fuzzy system Title and Abstract 2016
IEEE Fuzzy system Title and Abstract 2016 tsysglobalsolutions
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithmsFarhan Zaki
 
Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...ISA Interchange
 
ENGR 132 Final Project
ENGR 132 Final ProjectENGR 132 Final Project
ENGR 132 Final ProjectMia Sheppard
 
Progressive Duplicate Detection
Progressive Duplicate DetectionProgressive Duplicate Detection
Progressive Duplicate Detection1crore projects
 
Machine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical preventionMachine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical preventionPierre Gutierrez
 
Uber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectUber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectKushal417
 
part-4-structural-equation-modelling-qr.pptx
part-4-structural-equation-modelling-qr.pptxpart-4-structural-equation-modelling-qr.pptx
part-4-structural-equation-modelling-qr.pptxKaRim295737
 
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTINGPERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTINGIJMIT JOURNAL
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingPlanetData Network of Excellence
 

Similar to Michal Monselise - Online change point detection using spark streaming (20)

V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
 
Change Point Analysis
Change Point AnalysisChange Point Analysis
Change Point Analysis
 
Exponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLEExponential software reliability using SPRT: MLE
Exponential software reliability using SPRT: MLE
 
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATADETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATA
 
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data SetData Trend Analysis by Assigning Polynomial Function For Given Data Set
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
 
Internship_presentation
Internship_presentationInternship_presentation
Internship_presentation
 
PPT.pdf internship demo on machine lerning
PPT.pdf internship demo on machine lerningPPT.pdf internship demo on machine lerning
PPT.pdf internship demo on machine lerning
 
IEEE Fuzzy system Title and Abstract 2016
IEEE Fuzzy system Title and Abstract 2016 IEEE Fuzzy system Title and Abstract 2016
IEEE Fuzzy system Title and Abstract 2016
 
accessible-streaming-algorithms
accessible-streaming-algorithmsaccessible-streaming-algorithms
accessible-streaming-algorithms
 
Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...Influence of time and length size feature selections for human activity seque...
Influence of time and length size feature selections for human activity seque...
 
ENGR 132 Final Project
ENGR 132 Final ProjectENGR 132 Final Project
ENGR 132 Final Project
 
Progressive Duplicate Detection
Progressive Duplicate DetectionProgressive Duplicate Detection
Progressive Duplicate Detection
 
report
reportreport
report
 
Machine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical preventionMachine learning and Internet of Things, the future of medical prevention
Machine learning and Internet of Things, the future of medical prevention
 
Uber Data Analysis - SAS Project
Uber Data Analysis - SAS ProjectUber Data Analysis - SAS Project
Uber Data Analysis - SAS Project
 
part-4-structural-equation-modelling-qr.pptx
part-4-structural-equation-modelling-qr.pptxpart-4-structural-equation-modelling-qr.pptx
part-4-structural-equation-modelling-qr.pptx
 
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTINGPERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
PERFORMANCE ANALYSIS OF HYBRID FORECASTING MODEL IN STOCK MARKET FORECASTING
 
Statistical Analysis
Statistical AnalysisStatistical Analysis
Statistical Analysis
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Towards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory SensingTowards Enabling Probabilistic Databases for Participatory Sensing
Towards Enabling Probabilistic Databases for Participatory Sensing
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Michal Monselise - Online change point detection using spark streaming

  • 1. Online Change Point Detection Using Spark Streaming Michal Monselise PyData Seattle 2017
  • 2. A Bit About Me  Senior Data Scientist @ DS-IQ  Instructor at UW PCE
  • 3. Applications of Change Point Detection Genomics Marketing Finance
  • 4. Genomics  Change points in genomics can help identify genes that are damaged  We can look at the change points in aCGH profiles of genes to identify their involvement in cancer or other diseases
  • 7. Streaming  Most change point detection algorithms give us a static analysis of the change points in the data  However, with the advances in streaming, there is a greater need for real time analysis for change points  This project utilizes Spark Streaming since it is best suited for this use case
  • 8. Methodology  There are two different ways of performing online change point detection  Non parametric  Bayesian
  • 9. 𝑖𝑓 𝑑 𝑊𝑖𝑛𝑑𝑜𝑤1, 𝑊𝑖𝑛𝑑𝑜𝑤2 > 𝛼 𝑡ℎ𝑒𝑛 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑎 𝑐ℎ𝑎𝑛𝑔𝑒 𝑝𝑜𝑖𝑛𝑡 It is up to us to define d Non Parametric Algorithm
  • 10.  Bayes Theorem:  This means that we create a posterior distribution that is proportional to the product of the prior distribution (the initial assumption) and the likelihood (the observed data) Bayesian Statistics
  • 11. Bayesian Algorithm  Using a Bayesian algorithm means that we will update the distribution after every new observation and obtain a new posterior distribution  This solution works well with real time change point detection
  • 12. Time Between Change Points  The main idea here is that instead of modeling the data, we model the time till a change point.  𝑃 𝑟𝑡 𝑟𝑡−1 = 𝐻 𝑟𝑡−1 + 1 1 − 𝐻 𝑟𝑡−1 + 1 0 𝑖𝑓 𝑟𝑡 = 0 𝑖𝑓 𝑟𝑡 = 𝑟𝑡−1 + 1 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒  Where 𝐻 𝜏 is the hazard function 𝐻 𝜏 = 𝑃 𝑔𝑎𝑝(𝑔=𝜏) 𝑡=𝜏 ∞ 𝑃 𝑔𝑎𝑝(𝑔=𝑡)  𝑃𝑔𝑎𝑝 is the discrete a priori probability distribution over the interval between change points
  • 13. Limitations of Change Point Detection in Streaming  This algorithm is an improvement since it accounts for change points in real time rather than identifying change points after the fact  However, we still need to store all of the previous observations in order to compute the probability of a change point in real time  In a streaming scenario, this can get out of hand
  • 14.  We do not expect to use all possible data to create a posterior distribution  For example, we do not expect to have all possible temperature measurements from the dawn of time when we try to evaluate global warming Limitations of Change Point Detection in Streaming
  • 15.  In Spark Streaming, we can only retain data between micro batches using an accumulator variable  We can use a default numeric accumulator or create our own accumulator class Accumulators in Spark
  • 16. Accumulators in Spark  Therefore we create an accumulator queue with a size limit  We use this queue to compute the posterior distribution using only the latest MAX_QUEUE_SIZE observations
  • 17.  We would like to measure when it is hot or cold  One approach is to find when weather shifts abruptly  Change Point Detection is a good methodology to find these shifts Case Study – Weather
  • 18.  We look at the temperature for zip code 98006 (Bellevue, WA) between September 2016 and February 2017  Identifying the change points in the data will help us identify abrupt temperature shifts Case Study – Weather
  • 19. Case Study – Weather  Our hypothesis is that sudden changes in weather influence consumer behavior more than absolute temperatures  We also think that these changes influence behavior more than relative temperature (number of standard deviations from the mean)
  • 20. References  Adams RP, MacKay DJC. Bayesian online change point detection. University of Cambridge Technical Report; 2007.  Kifer D., Ben-David S, and Gehrke J. Detecting change in data streams. In Proceedings of the International Conference on Very Large Data Bases, Toronto, Canada, pp. 180–191; 2004.  Muggeo VMR, Adelfio G. Efficient change point detection for genomic sequences of continuous measurements. Bioinformatics 27 (2): 161-166; 2011.

Editor's Notes

  1. My name is Michal Monselise and today I will be talking about online change point detection.
  2. Change point detection is an area of research with many applications. It is used to detect changes in genomics to detect cancer, It can be used in marketing to detect a change in customer churn It also super useful for financial data
  3. Comparative genomic hybridization is a molecular cytogenetic method for analysing copy number variations (CNVs) relative to ploidy level in the DNA of a test sample compared to a reference sample, without the need for culturing cells. The aim of this technique is to quickly and efficiently compare two genomic DNA samples arising from two sources, which are most often closely related, because it is suspected that they contain differences in terms of either gains or losses of either whole chromosomes or subchromosomal regions (a portion of a whole chromosome).
  4. Churn Analysis Sudden increase in purchase of product
  5. The hazard function can be described by f(t)/S(t) S(t) = 1-F(t) F(t)=cumulative distribution function