4. Genomics
Change points in genomics can help identify genes that are
damaged
We can look at the change points in aCGH profiles of genes to
identify their involvement in cancer or other diseases
7. Streaming
Most change point detection algorithms give us a static analysis
of the change points in the data
However, with the advances in streaming, there is a greater
need for real time analysis for change points
This project utilizes Spark Streaming since it is best suited for
this use case
8. Methodology
There are two different ways of performing online change point
detection
Non parametric
Bayesian
9. 𝑖𝑓 𝑑 𝑊𝑖𝑛𝑑𝑜𝑤1, 𝑊𝑖𝑛𝑑𝑜𝑤2 > 𝛼 𝑡ℎ𝑒𝑛 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑎 𝑐ℎ𝑎𝑛𝑔𝑒 𝑝𝑜𝑖𝑛𝑡
It is up to us to define d
Non Parametric Algorithm
10. Bayes Theorem:
This means that we create a posterior distribution that is
proportional to the product of the prior distribution (the initial
assumption) and the likelihood (the observed data)
Bayesian Statistics
11. Bayesian Algorithm
Using a Bayesian algorithm means that we will update the
distribution after every new observation and obtain a new
posterior distribution
This solution works well with real time change point detection
12. Time Between Change Points
The main idea here is that instead of modeling the data, we model the
time till a change point.
𝑃 𝑟𝑡 𝑟𝑡−1 =
𝐻 𝑟𝑡−1 + 1
1 − 𝐻 𝑟𝑡−1 + 1
0
𝑖𝑓 𝑟𝑡 = 0
𝑖𝑓 𝑟𝑡 = 𝑟𝑡−1 + 1
𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Where 𝐻 𝜏 is the hazard function 𝐻 𝜏 =
𝑃 𝑔𝑎𝑝(𝑔=𝜏)
𝑡=𝜏
∞ 𝑃 𝑔𝑎𝑝(𝑔=𝑡)
𝑃𝑔𝑎𝑝 is the discrete a priori probability distribution over the interval
between change points
13. Limitations of Change Point Detection in
Streaming
This algorithm is an improvement since it accounts for change
points in real time rather than identifying change points after
the fact
However, we still need to store all of the previous observations
in order to compute the probability of a change point in real
time
In a streaming scenario, this can get out of hand
14. We do not expect to use all possible data to create a posterior
distribution
For example, we do not expect to have all possible temperature
measurements from the dawn of time when we try to evaluate
global warming
Limitations of Change Point Detection in
Streaming
15. In Spark Streaming, we can only retain data between micro
batches using an accumulator variable
We can use a default numeric accumulator or create our own
accumulator class
Accumulators in Spark
16. Accumulators in Spark
Therefore we create an accumulator queue with a size limit
We use this queue to compute the posterior distribution using
only the latest MAX_QUEUE_SIZE observations
17. We would like to measure when it is hot or cold
One approach is to find when weather shifts abruptly
Change Point Detection is a good methodology to find these
shifts
Case Study – Weather
18. We look at the temperature for zip code 98006 (Bellevue, WA)
between September 2016 and February 2017
Identifying the change points in the data will help us identify
abrupt temperature shifts
Case Study – Weather
19. Case Study – Weather
Our hypothesis is that sudden changes in weather influence
consumer behavior more than absolute temperatures
We also think that these changes influence behavior more than
relative temperature (number of standard deviations from the
mean)
20. References
Adams RP, MacKay DJC. Bayesian online change point
detection. University of Cambridge Technical Report; 2007.
Kifer D., Ben-David S, and Gehrke J. Detecting change in data
streams. In Proceedings of the International Conference on Very
Large Data Bases, Toronto, Canada, pp. 180–191; 2004.
Muggeo VMR, Adelfio G. Efficient change point detection for
genomic sequences of continuous measurements.
Bioinformatics 27 (2): 161-166; 2011.
Editor's Notes
My name is Michal Monselise and today I will be talking about online change point detection.
Change point detection is an area of research with many applications.
It is used to detect changes in genomics to detect cancer,
It can be used in marketing to detect a change in customer churn
It also super useful for financial data
Comparative genomic hybridization is a molecular cytogenetic method for analysing copy number variations (CNVs) relative to ploidy level in the DNA of a test sample compared to a reference sample, without the need for culturing cells. The aim of this technique is to quickly and efficiently compare two genomic DNA samples arising from two sources, which are most often closely related, because it is suspected that they contain differences in terms of either gains or losses of either whole chromosomes or subchromosomal regions (a portion of a whole chromosome).
Churn Analysis
Sudden increase in purchase of product
The hazard function can be described by f(t)/S(t) S(t) = 1-F(t) F(t)=cumulative distribution function