DTW: Dynamic Time Warping is a well-known method to find patterns within a time-series. It has the possibility to find a pattern even if the data are distorted. It can be used to detect trends in sell, defect in machine signals in the industry, medicine for electro-cardiograms, DNA…
Most of the implementations are usually very slow, but a very efficient open source implementation (best paper SIGKDD 2012) is implemented in C. It can be easily ported in other language, as Java, so that it can be then easily used in Flink.
We present how we did some slight modifications so that we can use with Flink at even greater scale to return the TopK best matches on past data or streaming data.
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Christophe Salperwyck, Akamai Technologies
1. FlinkDTW
Time-series pattern search at scale
using Dynamic Time Warping
Christophe Salperwyck, Akamai Kraków
https://www.linkedin.com/in/christophesalperwyck/
4. 4
Akamai is a leader in Content Delivery Network (CDN) services for
delivering, optimizing and securing online content and business
applications.
Founded in 1998 and rooted in MIT technology.
Solving Internet congestion with math not hardware.
9. Bio
Software engineer who moved to data mining/science/analytics/... ⇒ PhD in stream mining (2012)
A Survey on Supervised Classification on Data Streams
Interest in Machine Learning at scale
https://www.slideshare.net/Hadoop_Summit/courbospark-decision-tree-for-timeseries-on-spark
Used to work on Hadoop/HBase to store plants sensor / time series (1,000B points - 100TB)
https://www.slideshare.net/HadoopSummit/a-data-lake-and-a-data-lab-to-optimize-operations-and-safety-within-a-nuclear-fleet
Online learning - combining decision stump/tree to pick the best ad
https://www.slideshare.net/ChristopheSalperwyck/explorationexploitation2011salperwyckurvoycontr01
9
10. 1. Time series?
2. DTW: Dynamic Time Warping
3. Bibliography on Fast/Parallelize DTW
4. Use-case
5. Benchmark
6. Conclusion and Future works
10
15. 15
Many data are time series!
➔ IoT/IIoT data
➔ Sales/Marketing data
➔ Monitoring data: data centers, network...
➔ Science/Medicine: Earthquake, EEG, ECG, DNA...
➔ Social network: likes over time per specific category
➔ ...
16. What is a time series?
16
Wikipedia:
"A time series is a series of data points indexed in time order."
In Flink world:
<seriesId, timestamp, value> ⇒ Tuple3<String, Long, Double>
17. Time series pre processing / cleaning?
17
➔ Outliers
➔ Removing abnormal periods (too many missing values...)
➔ Filling gaps (with last value, interpolation...)
➔ Removing seasonality
➔ Subsampling if needed
➔ Transformations (FFT...)
➔ ...
24. UCR DTW - best KDD paper 2012
24
Searching and mining trillions of time series subsequences under dynamic time warping
Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon
Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh
KDD '12
https://www.cs.ucr.edu/~eamonn/UCRsuite.html
An influential paper on gesture recognition
on multi-touch screens laments that “DTW
took 128.6 minutes to run the 14,400 tests
for a give subject’s 160 gestures.” However,
we can reproduce the results in under 3
seconds.
25. Why is it so fast? Early abandoning!
25
R ⇒ Wrapping band (path deviation)
n ⇒ Query length
26. Related work
26
Spark (2015) - large scale
Parallelization of Searching and Mining Time Series Data using Dynamic Time Warping
Shabib, Ahmed & Narang, Anish & Prasad Niddodi, Chaitra & Das, Madhura & Pradeep, Rachita & Shenoy, Varun &
Auradkar, Prafullata & TS, Vignesh & Sitaram, Dinkar.
International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Flink (2019) - fast detection
Time Series Similarity Search for Streaming Data in Distributed Systems
Ziehn, Ariane & Charfuelan Oliva, Marcela & Hemsen, Holmer & Markl, Volker.
Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference. Data Analytics Solutions for Real-Life Applications.
28. In the event of a frequency variation consisting of a downward
ramp of Δf = 50 mHz in 10 s followed by a stabilised regime,
where the programmed Frequency Containment Reserve is
greater than K.Δf, the Generation Unit must release:
- 50% of the expected variation K.Δf in 20 s for Reserve Entities made up of Thermal
Generation Units (in 100 s for Reserve Entities made up of Hydroelectric Generation Units);
- 90% of the expected variation K.Δf in 60 s for Reserve Entities made up of Thermal
Generation Units (in 300 s for Reserve Entities made up of Hydroelectric Generation Units).
https://www.next-kraftwerke.com/energy-blog/who-is-disrupting-the-utility-frequency
http://clients.rte-france.com/htm/an/offre/telecharge/20140101_Regles_SSY_approuvees_an.pdf
http://clients.rte-france.com/htm/fr/offre/telecharge/20181026_Regles_services_systeme_frequence.pdf
https://www.mainsfrequency.com/frequ_info_en.php
28
Grid frequency: regulation
32. Some stats on pruning
We almost never compute the full DTW!
Example:
Pruned by LB_KimFL: 95%
Pruned by LB_Keogh: 5%
Full DTW Calculation: 0.008%
32
33. Some issues
We have to handle:
➔ change of partitions in the code
➔ search at partition splits (not to lose any detections)
33
41. Streaming issues
➔ Jumping windows ⇒ we might miss some detections at the junction
➔ Can be fixed using sliding windows but for large sliding windows,
"evict" on the CountEvictor is slow.
41
46. Conclusion
➔ Original algorithm really works fast! ⇒ easy to use as is and to take
advantage of Flink directly
➔ Can be use on massive past data very efficiently
➔ Can be use on streaming data but would need some tweakings for
better performances on small windows
46
47. Future works
➔ Dynamically change the patterns using a stream of update on the patterns
➔ Use Flink for pre filtering windows (min/max, CEP...)
➔ Continue testing on Kubernetes cluster
➔ Optimization for smaller windows?
➔ Use Fold function instead of Process?
47
48. Which Flink function to use?
ProcessWindowFunction
A ProcessWindowFunction gets an Iterable containing all the elements of the window,
and a Context object with access to time and state information, which enables it to
provide more flexibility than other window functions. This comes at the cost of
performance and resource consumption, because elements cannot be incrementally
aggregated but instead need to be buffered internally until the window is considered
ready for processing.
FoldFunction
A FoldFunction specifies how an input element of the window is combined with an
element of the output type. The FoldFunction is incrementally called for each element
that is added to the window and the current output value. The first element is combined
with a pre-defined initial value of the output type.
48
51. Online machine learning with Flink and MOA
Blogpost: https://moa.cms.waikato.ac.nz/moa-with-apache-flink/
GitHub repo: https://github.com/csalperwyck/
- moa.flink.traintest:
- Train a model on a stream, test/deploy it on another one
- Flink take care of pushing model updates: CoFlatMapFunction
- moa.flink.ozabag:
- Train many models in parallel (Random Forest for example)
- Dynamic scaling should work on this kind of workload!
51