Michal Monselise - Online change point detection using spark streaming

Online Change Point
Detection Using Spark
Streaming
Michal Monselise
PyData Seattle 2017

A Bit About Me
 Senior Data Scientist @ DS-IQ
 Instructor at UW PCE

Applications of Change Point Detection
Genomics
Marketing
Finance

Genomics
 Change points in genomics can help identify genes that are
damaged
 We can look at the change points in aCGH profiles of genes to
identify their involvement in cancer or other diseases

Streaming
 Most change point detection algorithms give us a static analysis
of the change points in the data
 However, with the advances in streaming, there is a greater
need for real time analysis for change points
 This project utilizes Spark Streaming since it is best suited for
this use case

Methodology
 There are two different ways of performing online change point
detection
 Non parametric
 Bayesian

𝑖𝑓 𝑑 𝑊𝑖𝑛𝑑𝑜𝑤1, 𝑊𝑖𝑛𝑑𝑜𝑤2 > 𝛼 𝑡ℎ𝑒𝑛 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑎 𝑐ℎ𝑎𝑛𝑔𝑒 𝑝𝑜𝑖𝑛𝑡
It is up to us to define d
Non Parametric Algorithm

 Bayes Theorem:
 This means that we create a posterior distribution that is
proportional to the product of the prior distribution (the initial
assumption) and the likelihood (the observed data)
Bayesian Statistics

Bayesian Algorithm
 Using a Bayesian algorithm means that we will update the
distribution after every new observation and obtain a new
posterior distribution
 This solution works well with real time change point detection

Time Between Change Points
 The main idea here is that instead of modeling the data, we model the
time till a change point.
 𝑃 𝑟𝑡 𝑟𝑡−1 =
𝐻 𝑟𝑡−1 + 1
1 − 𝐻 𝑟𝑡−1 + 1
0
𝑖𝑓 𝑟𝑡 = 0
𝑖𝑓 𝑟𝑡 = 𝑟𝑡−1 + 1
𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 Where 𝐻 𝜏 is the hazard function 𝐻 𝜏 =
𝑃 𝑔𝑎𝑝(𝑔=𝜏)
𝑡=𝜏
∞ 𝑃 𝑔𝑎𝑝(𝑔=𝑡)
 𝑃𝑔𝑎𝑝 is the discrete a priori probability distribution over the interval
between change points

Limitations of Change Point Detection in
Streaming
 This algorithm is an improvement since it accounts for change
points in real time rather than identifying change points after
the fact
 However, we still need to store all of the previous observations
in order to compute the probability of a change point in real
time
 In a streaming scenario, this can get out of hand

 We do not expect to use all possible data to create a posterior
distribution
 For example, we do not expect to have all possible temperature
measurements from the dawn of time when we try to evaluate
global warming
Limitations of Change Point Detection in
Streaming

 In Spark Streaming, we can only retain data between micro
batches using an accumulator variable
 We can use a default numeric accumulator or create our own
accumulator class
Accumulators in Spark

Accumulators in Spark
 Therefore we create an accumulator queue with a size limit
 We use this queue to compute the posterior distribution using
only the latest MAX_QUEUE_SIZE observations

 We would like to measure when it is hot or cold
 One approach is to find when weather shifts abruptly
 Change Point Detection is a good methodology to find these
shifts
Case Study – Weather

 We look at the temperature for zip code 98006 (Bellevue, WA)
between September 2016 and February 2017
 Identifying the change points in the data will help us identify
abrupt temperature shifts

 Our hypothesis is that sudden changes in weather influence
consumer behavior more than absolute temperatures
 We also think that these changes influence behavior more than
relative temperature (number of standard deviations from the
mean)

References
 Adams RP, MacKay DJC. Bayesian online change point
detection. University of Cambridge Technical Report; 2007.
 Kifer D., Ben-David S, and Gehrke J. Detecting change in data
streams. In Proceedings of the International Conference on Very
Large Data Bases, Toronto, Canada, pp. 180–191; 2004.
 Muggeo VMR, Adelfio G. Efficient change point detection for
genomic sequences of continuous measurements.
Bioinformatics 27 (2): 161-166; 2011.

Michal Monselise - Online change point detection using spark streaming

Recommended

Recommended

More Related Content

Similar to Michal Monselise - Online change point detection using spark streaming

Similar to Michal Monselise - Online change point detection using spark streaming (20)

More from PyData

More from PyData (20)

Recently uploaded

Recently uploaded (20)

Michal Monselise - Online change point detection using spark streaming

Editor's Notes