SlideShare a Scribd company logo
FlinkDTW
Time-series pattern search at scale
using Dynamic Time Warping
Christophe Salperwyck, Akamai Kraków
https://www.linkedin.com/in/christophesalperwyck/
2
Few words about Akamai
3
4
Akamai is a leader in Content Delivery Network (CDN) services for
delivering, optimizing and securing online content and business
applications.
Founded in 1998 and rooted in MIT technology.
Solving Internet congestion with math not hardware.
5
Akamai Intelligent Edge
(HTTP+HTTPS+DNS+MQTT)
240,000+ Servers
1300+ Cities
140+ Countries on 7 Continents
82.4 Tbps Peaks
Few words about me
7
8
Bio
Software engineer who moved to data mining/science/analytics/... ⇒ PhD in stream mining (2012)
A Survey on Supervised Classification on Data Streams
Interest in Machine Learning at scale
https://www.slideshare.net/Hadoop_Summit/courbospark-decision-tree-for-timeseries-on-spark
Used to work on Hadoop/HBase to store plants sensor / time series (1,000B points - 100TB)
https://www.slideshare.net/HadoopSummit/a-data-lake-and-a-data-lab-to-optimize-operations-and-safety-within-a-nuclear-fleet
Online learning - combining decision stump/tree to pick the best ad
https://www.slideshare.net/ChristopheSalperwyck/explorationexploitation2011salperwyckurvoycontr01
9
1. Time series?
2. DTW: Dynamic Time Warping
3. Bibliography on Fast/Parallelize DTW
4. Use-case
5. Benchmark
6. Conclusion and Future works
10
11
12
13
Time series?
14
15
Many data are time series!
➔ IoT/IIoT data
➔ Sales/Marketing data
➔ Monitoring data: data centers, network...
➔ Science/Medicine: Earthquake, EEG, ECG, DNA...
➔ Social network: likes over time per specific category
➔ ...
What is a time series?
16
Wikipedia:
"A time series is a series of data points indexed in time order."
In Flink world:
<seriesId, timestamp, value> ⇒ Tuple3<String, Long, Double>
Time series pre processing / cleaning?
17
➔ Outliers
➔ Removing abnormal periods (too many missing values...)
➔ Filling gaps (with last value, interpolation...)
➔ Removing seasonality
➔ Subsampling if needed
➔ Transformations (FFT...)
➔ ...
18
http://sites.music.mcgill.ca/orchestration/files/2013/08/Esling_2012_ACMComputSurv.pdf
Time series mining
Pattern search using DTW
Dynamic Time Warping
19
Pattern search
20
Which distance to use?
21
DTW algorithm
22
Searching and mining trillions of time series subsequences under dynamic time warping – Rakthanmanon et al. SIGKDD 2012
Fast/Parallelize DTW
How to speed it up?
23
UCR DTW - best KDD paper 2012
24
Searching and mining trillions of time series subsequences under dynamic time warping
Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon
Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh
KDD '12
https://www.cs.ucr.edu/~eamonn/UCRsuite.html
An influential paper on gesture recognition
on multi-touch screens laments that “DTW
took 128.6 minutes to run the 14,400 tests
for a give subject’s 160 gestures.” However,
we can reproduce the results in under 3
seconds.
Why is it so fast? Early abandoning!
25
R ⇒ Wrapping band (path deviation)
n ⇒ Query length
Related work
26
Spark (2015) - large scale
Parallelization of Searching and Mining Time Series Data using Dynamic Time Warping
Shabib, Ahmed & Narang, Anish & Prasad Niddodi, Chaitra & Das, Madhura & Pradeep, Rachita & Shenoy, Varun &
Auradkar, Prafullata & TS, Vignesh & Sitaram, Dinkar.
International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Flink (2019) - fast detection
Time Series Similarity Search for Streaming Data in Distributed Systems
Ziehn, Ariane & Charfuelan Oliva, Marcela & Hemsen, Holmer & Markl, Volker.
Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference. Data Analytics Solutions for Real-Life Applications.
Use case
utilities: grid frequency
27
In the event of a frequency variation consisting of a downward
ramp of Δf = 50 mHz in 10 s followed by a stabilised regime,
where the programmed Frequency Containment Reserve is
greater than K.Δf, the Generation Unit must release:
- 50% of the expected variation K.Δf in 20 s for Reserve Entities made up of Thermal
Generation Units (in 100 s for Reserve Entities made up of Hydroelectric Generation Units);
- 90% of the expected variation K.Δf in 60 s for Reserve Entities made up of Thermal
Generation Units (in 300 s for Reserve Entities made up of Hydroelectric Generation Units).
https://www.next-kraftwerke.com/energy-blog/who-is-disrupting-the-utility-frequency
http://clients.rte-france.com/htm/an/offre/telecharge/20140101_Regles_SSY_approuvees_an.pdf
http://clients.rte-france.com/htm/fr/offre/telecharge/20181026_Regles_services_systeme_frequence.pdf
https://www.mainsfrequency.com/frequ_info_en.php
28
Grid frequency: regulation
Experiments
Data: open data
https://www.nationalgrideso.com/balancing-services/frequency-response-services/historic-frequency-data
Data size: 168M (~5 years at 1s)
Time to read the data: 45 seconds (AVRO)
Pattern search: < 1 minute (pattern size from 100s to 500s)
Search speed: > 3M points/second
29
Results: 1% wrapping
30
Results: 10% wrapping
31
Some stats on pruning
We almost never compute the full DTW!
Example:
Pruned by LB_KimFL: 95%
Pruned by LB_Keogh: 5%
Full DTW Calculation: 0.008%
32
Some issues
We have to handle:
➔ change of partitions in the code
➔ search at partition splits (not to lose any detections)
33
Streaming performance
benchmark
34
Settings
35
Hardware
Laptop ⇒ i5, 16 GB, SSD…
Tested with 1 thread (unfortunately!)
Some first tests with Kubernetes
Streaming Data Generator
Random walk: ~2M to 3M points/second
(without Flink 20M+/s)
Performance for a jumping window of 10,000
36
Performance for a jumping window of 1M
37
Pruning vs window size
(R=10%, Query size=100)
38
Window = 10,000 Window = 1,000,000
Performance for a jumping window of 10,000
39
Performance for a jumping window of 1M
40
Streaming issues
➔ Jumping windows ⇒ we might miss some detections at the junction
➔ Can be fixed using sliding windows but for large sliding windows,
"evict" on the CountEvictor is slow.
41
Kubernetes configuration
42
40 Flink task managers deployed (4GB per TM)
First runs in k8s...
43
100M
4 Billion
(8M/s)
One VM performance
F4
44
Conclusion and Future works
45
Conclusion
➔ Original algorithm really works fast! ⇒ easy to use as is and to take
advantage of Flink directly
➔ Can be use on massive past data very efficiently
➔ Can be use on streaming data but would need some tweakings for
better performances on small windows
46
Future works
➔ Dynamically change the patterns using a stream of update on the patterns
➔ Use Flink for pre filtering windows (min/max, CEP...)
➔ Continue testing on Kubernetes cluster
➔ Optimization for smaller windows?
➔ Use Fold function instead of Process?
47
Which Flink function to use?
ProcessWindowFunction
A ProcessWindowFunction gets an Iterable containing all the elements of the window,
and a Context object with access to time and state information, which enables it to
provide more flexibility than other window functions. This comes at the cost of
performance and resource consumption, because elements cannot be incrementally
aggregated but instead need to be buffered internally until the window is considered
ready for processing.
FoldFunction
A FoldFunction specifies how an input element of the window is combined with an
element of the output type. The FoldFunction is incrementally called for each element
that is added to the window and the current output value. The first element is combined
with a pre-defined initial value of the output type.
48
That’s all folks!
Flink - Online ML with MOA
50
Online machine learning with Flink and MOA
Blogpost: https://moa.cms.waikato.ac.nz/moa-with-apache-flink/
GitHub repo: https://github.com/csalperwyck/
- moa.flink.traintest:
- Train a model on a stream, test/deploy it on another one
- Flink take care of pushing model updates: CoFlatMapFunction
- moa.flink.ozabag:
- Train many models in parallel (Random Forest for example)
- Dynamic scaling should work on this kind of workload!
51

More Related Content

What's hot

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)Cory Cook
 
はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半
Prunus 1350
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using python
Guido Luz Percú
 
Sliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルSliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデル
ohken
 
PyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしないPyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしない
Toshihiro Kamishima
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
はじパタ11章 後半
はじパタ11章 後半はじパタ11章 後半
はじパタ11章 後半Atsushi Hayakawa
 
PRML6.4
PRML6.4PRML6.4
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
Institute of Technology Telkom
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
Roelof Pieters
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentation
Eleni Stamatelou
 
Confidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたConfidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたtkng
 
An overview of methods for data anonymization
An overview of methods for data anonymizationAn overview of methods for data anonymization
An overview of methods for data anonymization
arx-deidentifier
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Hoang Nguyen
 
はじめてのパターン認識 第11章 11.1-11.2
はじめてのパターン認識 第11章 11.1-11.2はじめてのパターン認識 第11章 11.1-11.2
はじめてのパターン認識 第11章 11.1-11.2
Prunus 1350
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
Kent State University
 
「統計的学習理論」第1章
「統計的学習理論」第1章「統計的学習理論」第1章
「統計的学習理論」第1章
Kota Matsui
 
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 

What's hot (20)

Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
 
はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半はじめてのパターン認識 第6章 後半
はじめてのパターン認識 第6章 後半
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using python
 
Sliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデルSliced Wasserstein距離と生成モデル
Sliced Wasserstein距離と生成モデル
 
PyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしないPyMCがあれば,ベイズ推定でもう泣いたりなんかしない
PyMCがあれば,ベイズ推定でもう泣いたりなんかしない
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
 
はじパタ11章 後半
はじパタ11章 後半はじパタ11章 後半
はじパタ11章 後半
 
PRML6.4
PRML6.4PRML6.4
PRML6.4
 
03 data mining : data warehouse
03 data mining : data warehouse03 data mining : data warehouse
03 data mining : data warehouse
 
Zero shot learning through cross-modal transfer
Zero shot learning through cross-modal transferZero shot learning through cross-modal transfer
Zero shot learning through cross-modal transfer
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentation
 
Confidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみたConfidence Weightedで ランク学習を実装してみた
Confidence Weightedで ランク学習を実装してみた
 
An overview of methods for data anonymization
An overview of methods for data anonymizationAn overview of methods for data anonymization
An overview of methods for data anonymization
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
はじめてのパターン認識 第11章 11.1-11.2
はじめてのパターン認識 第11章 11.1-11.2はじめてのパターン認識 第11章 11.1-11.2
はじめてのパターン認識 第11章 11.1-11.2
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
 
「統計的学習理論」第1章
「統計的学習理論」第1章「統計的学習理論」第1章
「統計的学習理論」第1章
 
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 4 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 

Similar to FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Christophe Salperwyck, Akamai Technologies

Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
Ian Foster
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
DataWorks Summit/Hadoop Summit
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
inside-BigData.com
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
Larry Smarr
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1blewington
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrations
inside-BigData.com
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suitesmarru
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
Robert Grossman
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
Tal Lavian Ph.D.
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
DataWorks Summit/Hadoop Summit
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
Databricks
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Big Data Spain
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
marpierc
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
Ian Foster
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the Cloud
Wolfgang Gentzsch
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
confluent
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 

Similar to FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Christophe Salperwyck, Akamai Technologies (20)

Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrations
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suite
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the Cloud
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 

FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Christophe Salperwyck, Akamai Technologies

  • 1. FlinkDTW Time-series pattern search at scale using Dynamic Time Warping Christophe Salperwyck, Akamai Kraków https://www.linkedin.com/in/christophesalperwyck/
  • 2. 2
  • 3. Few words about Akamai 3
  • 4. 4 Akamai is a leader in Content Delivery Network (CDN) services for delivering, optimizing and securing online content and business applications. Founded in 1998 and rooted in MIT technology. Solving Internet congestion with math not hardware.
  • 5. 5
  • 6. Akamai Intelligent Edge (HTTP+HTTPS+DNS+MQTT) 240,000+ Servers 1300+ Cities 140+ Countries on 7 Continents 82.4 Tbps Peaks
  • 8. 8
  • 9. Bio Software engineer who moved to data mining/science/analytics/... ⇒ PhD in stream mining (2012) A Survey on Supervised Classification on Data Streams Interest in Machine Learning at scale https://www.slideshare.net/Hadoop_Summit/courbospark-decision-tree-for-timeseries-on-spark Used to work on Hadoop/HBase to store plants sensor / time series (1,000B points - 100TB) https://www.slideshare.net/HadoopSummit/a-data-lake-and-a-data-lab-to-optimize-operations-and-safety-within-a-nuclear-fleet Online learning - combining decision stump/tree to pick the best ad https://www.slideshare.net/ChristopheSalperwyck/explorationexploitation2011salperwyckurvoycontr01 9
  • 10. 1. Time series? 2. DTW: Dynamic Time Warping 3. Bibliography on Fast/Parallelize DTW 4. Use-case 5. Benchmark 6. Conclusion and Future works 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 15. 15 Many data are time series! ➔ IoT/IIoT data ➔ Sales/Marketing data ➔ Monitoring data: data centers, network... ➔ Science/Medicine: Earthquake, EEG, ECG, DNA... ➔ Social network: likes over time per specific category ➔ ...
  • 16. What is a time series? 16 Wikipedia: "A time series is a series of data points indexed in time order." In Flink world: <seriesId, timestamp, value> ⇒ Tuple3<String, Long, Double>
  • 17. Time series pre processing / cleaning? 17 ➔ Outliers ➔ Removing abnormal periods (too many missing values...) ➔ Filling gaps (with last value, interpolation...) ➔ Removing seasonality ➔ Subsampling if needed ➔ Transformations (FFT...) ➔ ...
  • 19. Pattern search using DTW Dynamic Time Warping 19
  • 21. Which distance to use? 21
  • 22. DTW algorithm 22 Searching and mining trillions of time series subsequences under dynamic time warping – Rakthanmanon et al. SIGKDD 2012
  • 23. Fast/Parallelize DTW How to speed it up? 23
  • 24. UCR DTW - best KDD paper 2012 24 Searching and mining trillions of time series subsequences under dynamic time warping Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh KDD '12 https://www.cs.ucr.edu/~eamonn/UCRsuite.html An influential paper on gesture recognition on multi-touch screens laments that “DTW took 128.6 minutes to run the 14,400 tests for a give subject’s 160 gestures.” However, we can reproduce the results in under 3 seconds.
  • 25. Why is it so fast? Early abandoning! 25 R ⇒ Wrapping band (path deviation) n ⇒ Query length
  • 26. Related work 26 Spark (2015) - large scale Parallelization of Searching and Mining Time Series Data using Dynamic Time Warping Shabib, Ahmed & Narang, Anish & Prasad Niddodi, Chaitra & Das, Madhura & Pradeep, Rachita & Shenoy, Varun & Auradkar, Prafullata & TS, Vignesh & Sitaram, Dinkar. International Conference on Advances in Computing, Communications and Informatics (ICACCI) Flink (2019) - fast detection Time Series Similarity Search for Streaming Data in Distributed Systems Ziehn, Ariane & Charfuelan Oliva, Marcela & Hemsen, Holmer & Markl, Volker. Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference. Data Analytics Solutions for Real-Life Applications.
  • 27. Use case utilities: grid frequency 27
  • 28. In the event of a frequency variation consisting of a downward ramp of Δf = 50 mHz in 10 s followed by a stabilised regime, where the programmed Frequency Containment Reserve is greater than K.Δf, the Generation Unit must release: - 50% of the expected variation K.Δf in 20 s for Reserve Entities made up of Thermal Generation Units (in 100 s for Reserve Entities made up of Hydroelectric Generation Units); - 90% of the expected variation K.Δf in 60 s for Reserve Entities made up of Thermal Generation Units (in 300 s for Reserve Entities made up of Hydroelectric Generation Units). https://www.next-kraftwerke.com/energy-blog/who-is-disrupting-the-utility-frequency http://clients.rte-france.com/htm/an/offre/telecharge/20140101_Regles_SSY_approuvees_an.pdf http://clients.rte-france.com/htm/fr/offre/telecharge/20181026_Regles_services_systeme_frequence.pdf https://www.mainsfrequency.com/frequ_info_en.php 28 Grid frequency: regulation
  • 29. Experiments Data: open data https://www.nationalgrideso.com/balancing-services/frequency-response-services/historic-frequency-data Data size: 168M (~5 years at 1s) Time to read the data: 45 seconds (AVRO) Pattern search: < 1 minute (pattern size from 100s to 500s) Search speed: > 3M points/second 29
  • 32. Some stats on pruning We almost never compute the full DTW! Example: Pruned by LB_KimFL: 95% Pruned by LB_Keogh: 5% Full DTW Calculation: 0.008% 32
  • 33. Some issues We have to handle: ➔ change of partitions in the code ➔ search at partition splits (not to lose any detections) 33
  • 35. Settings 35 Hardware Laptop ⇒ i5, 16 GB, SSD… Tested with 1 thread (unfortunately!) Some first tests with Kubernetes Streaming Data Generator Random walk: ~2M to 3M points/second (without Flink 20M+/s)
  • 36. Performance for a jumping window of 10,000 36
  • 37. Performance for a jumping window of 1M 37
  • 38. Pruning vs window size (R=10%, Query size=100) 38 Window = 10,000 Window = 1,000,000
  • 39. Performance for a jumping window of 10,000 39
  • 40. Performance for a jumping window of 1M 40
  • 41. Streaming issues ➔ Jumping windows ⇒ we might miss some detections at the junction ➔ Can be fixed using sliding windows but for large sliding windows, "evict" on the CountEvictor is slow. 41
  • 42. Kubernetes configuration 42 40 Flink task managers deployed (4GB per TM)
  • 43. First runs in k8s... 43 100M 4 Billion (8M/s)
  • 46. Conclusion ➔ Original algorithm really works fast! ⇒ easy to use as is and to take advantage of Flink directly ➔ Can be use on massive past data very efficiently ➔ Can be use on streaming data but would need some tweakings for better performances on small windows 46
  • 47. Future works ➔ Dynamically change the patterns using a stream of update on the patterns ➔ Use Flink for pre filtering windows (min/max, CEP...) ➔ Continue testing on Kubernetes cluster ➔ Optimization for smaller windows? ➔ Use Fold function instead of Process? 47
  • 48. Which Flink function to use? ProcessWindowFunction A ProcessWindowFunction gets an Iterable containing all the elements of the window, and a Context object with access to time and state information, which enables it to provide more flexibility than other window functions. This comes at the cost of performance and resource consumption, because elements cannot be incrementally aggregated but instead need to be buffered internally until the window is considered ready for processing. FoldFunction A FoldFunction specifies how an input element of the window is combined with an element of the output type. The FoldFunction is incrementally called for each element that is added to the window and the current output value. The first element is combined with a pre-defined initial value of the output type. 48
  • 50. Flink - Online ML with MOA 50
  • 51. Online machine learning with Flink and MOA Blogpost: https://moa.cms.waikato.ac.nz/moa-with-apache-flink/ GitHub repo: https://github.com/csalperwyck/ - moa.flink.traintest: - Train a model on a stream, test/deploy it on another one - Flink take care of pushing model updates: CoFlatMapFunction - moa.flink.ozabag: - Train many models in parallel (Random Forest for example) - Dynamic scaling should work on this kind of workload! 51