SlideShare a Scribd company logo
1 of 51
Download to read offline
FlinkDTW
Time-series pattern search at scale
using Dynamic Time Warping
Christophe Salperwyck, Akamai Kraków
https://www.linkedin.com/in/christophesalperwyck/
2
Few words about Akamai
3
4
Akamai is a leader in Content Delivery Network (CDN) services for
delivering, optimizing and securing online content and business
applications.
Founded in 1998 and rooted in MIT technology.
Solving Internet congestion with math not hardware.
5
Akamai Intelligent Edge
(HTTP+HTTPS+DNS+MQTT)
240,000+ Servers
1300+ Cities
140+ Countries on 7 Continents
82.4 Tbps Peaks
Few words about me
7
8
Bio
Software engineer who moved to data mining/science/analytics/... ⇒ PhD in stream mining (2012)
A Survey on Supervised Classification on Data Streams
Interest in Machine Learning at scale
https://www.slideshare.net/Hadoop_Summit/courbospark-decision-tree-for-timeseries-on-spark
Used to work on Hadoop/HBase to store plants sensor / time series (1,000B points - 100TB)
https://www.slideshare.net/HadoopSummit/a-data-lake-and-a-data-lab-to-optimize-operations-and-safety-within-a-nuclear-fleet
Online learning - combining decision stump/tree to pick the best ad
https://www.slideshare.net/ChristopheSalperwyck/explorationexploitation2011salperwyckurvoycontr01
9
1. Time series?
2. DTW: Dynamic Time Warping
3. Bibliography on Fast/Parallelize DTW
4. Use-case
5. Benchmark
6. Conclusion and Future works
10
11
12
13
Time series?
14
15
Many data are time series!
➔ IoT/IIoT data
➔ Sales/Marketing data
➔ Monitoring data: data centers, network...
➔ Science/Medicine: Earthquake, EEG, ECG, DNA...
➔ Social network: likes over time per specific category
➔ ...
What is a time series?
16
Wikipedia:
"A time series is a series of data points indexed in time order."
In Flink world:
<seriesId, timestamp, value> ⇒ Tuple3<String, Long, Double>
Time series pre processing / cleaning?
17
➔ Outliers
➔ Removing abnormal periods (too many missing values...)
➔ Filling gaps (with last value, interpolation...)
➔ Removing seasonality
➔ Subsampling if needed
➔ Transformations (FFT...)
➔ ...
18
http://sites.music.mcgill.ca/orchestration/files/2013/08/Esling_2012_ACMComputSurv.pdf
Time series mining
Pattern search using DTW
Dynamic Time Warping
19
Pattern search
20
Which distance to use?
21
DTW algorithm
22
Searching and mining trillions of time series subsequences under dynamic time warping – Rakthanmanon et al. SIGKDD 2012
Fast/Parallelize DTW
How to speed it up?
23
UCR DTW - best KDD paper 2012
24
Searching and mining trillions of time series subsequences under dynamic time warping
Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon
Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh
KDD '12
https://www.cs.ucr.edu/~eamonn/UCRsuite.html
An influential paper on gesture recognition
on multi-touch screens laments that “DTW
took 128.6 minutes to run the 14,400 tests
for a give subject’s 160 gestures.” However,
we can reproduce the results in under 3
seconds.
Why is it so fast? Early abandoning!
25
R ⇒ Wrapping band (path deviation)
n ⇒ Query length
Related work
26
Spark (2015) - large scale
Parallelization of Searching and Mining Time Series Data using Dynamic Time Warping
Shabib, Ahmed & Narang, Anish & Prasad Niddodi, Chaitra & Das, Madhura & Pradeep, Rachita & Shenoy, Varun &
Auradkar, Prafullata & TS, Vignesh & Sitaram, Dinkar.
International Conference on Advances in Computing, Communications and Informatics (ICACCI)
Flink (2019) - fast detection
Time Series Similarity Search for Streaming Data in Distributed Systems
Ziehn, Ariane & Charfuelan Oliva, Marcela & Hemsen, Holmer & Markl, Volker.
Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference. Data Analytics Solutions for Real-Life Applications.
Use case
utilities: grid frequency
27
In the event of a frequency variation consisting of a downward
ramp of Δf = 50 mHz in 10 s followed by a stabilised regime,
where the programmed Frequency Containment Reserve is
greater than K.Δf, the Generation Unit must release:
- 50% of the expected variation K.Δf in 20 s for Reserve Entities made up of Thermal
Generation Units (in 100 s for Reserve Entities made up of Hydroelectric Generation Units);
- 90% of the expected variation K.Δf in 60 s for Reserve Entities made up of Thermal
Generation Units (in 300 s for Reserve Entities made up of Hydroelectric Generation Units).
https://www.next-kraftwerke.com/energy-blog/who-is-disrupting-the-utility-frequency
http://clients.rte-france.com/htm/an/offre/telecharge/20140101_Regles_SSY_approuvees_an.pdf
http://clients.rte-france.com/htm/fr/offre/telecharge/20181026_Regles_services_systeme_frequence.pdf
https://www.mainsfrequency.com/frequ_info_en.php
28
Grid frequency: regulation
Experiments
Data: open data
https://www.nationalgrideso.com/balancing-services/frequency-response-services/historic-frequency-data
Data size: 168M (~5 years at 1s)
Time to read the data: 45 seconds (AVRO)
Pattern search: < 1 minute (pattern size from 100s to 500s)
Search speed: > 3M points/second
29
Results: 1% wrapping
30
Results: 10% wrapping
31
Some stats on pruning
We almost never compute the full DTW!
Example:
Pruned by LB_KimFL: 95%
Pruned by LB_Keogh: 5%
Full DTW Calculation: 0.008%
32
Some issues
We have to handle:
➔ change of partitions in the code
➔ search at partition splits (not to lose any detections)
33
Streaming performance
benchmark
34
Settings
35
Hardware
Laptop ⇒ i5, 16 GB, SSD…
Tested with 1 thread (unfortunately!)
Some first tests with Kubernetes
Streaming Data Generator
Random walk: ~2M to 3M points/second
(without Flink 20M+/s)
Performance for a jumping window of 10,000
36
Performance for a jumping window of 1M
37
Pruning vs window size
(R=10%, Query size=100)
38
Window = 10,000 Window = 1,000,000
Performance for a jumping window of 10,000
39
Performance for a jumping window of 1M
40
Streaming issues
➔ Jumping windows ⇒ we might miss some detections at the junction
➔ Can be fixed using sliding windows but for large sliding windows,
"evict" on the CountEvictor is slow.
41
Kubernetes configuration
42
40 Flink task managers deployed (4GB per TM)
First runs in k8s...
43
100M
4 Billion
(8M/s)
One VM performance
F4
44
Conclusion and Future works
45
Conclusion
➔ Original algorithm really works fast! ⇒ easy to use as is and to take
advantage of Flink directly
➔ Can be use on massive past data very efficiently
➔ Can be use on streaming data but would need some tweakings for
better performances on small windows
46
Future works
➔ Dynamically change the patterns using a stream of update on the patterns
➔ Use Flink for pre filtering windows (min/max, CEP...)
➔ Continue testing on Kubernetes cluster
➔ Optimization for smaller windows?
➔ Use Fold function instead of Process?
47
Which Flink function to use?
ProcessWindowFunction
A ProcessWindowFunction gets an Iterable containing all the elements of the window,
and a Context object with access to time and state information, which enables it to
provide more flexibility than other window functions. This comes at the cost of
performance and resource consumption, because elements cannot be incrementally
aggregated but instead need to be buffered internally until the window is considered
ready for processing.
FoldFunction
A FoldFunction specifies how an input element of the window is combined with an
element of the output type. The FoldFunction is incrementally called for each element
that is added to the window and the current output value. The first element is combined
with a pre-defined initial value of the output type.
48
That’s all folks!
Flink - Online ML with MOA
50
Online machine learning with Flink and MOA
Blogpost: https://moa.cms.waikato.ac.nz/moa-with-apache-flink/
GitHub repo: https://github.com/csalperwyck/
- moa.flink.traintest:
- Train a model on a stream, test/deploy it on another one
- Flink take care of pushing model updates: CoFlatMapFunction
- moa.flink.ozabag:
- Train many models in parallel (Random Forest for example)
- Dynamic scaling should work on this kind of workload!
51

More Related Content

What's hot

Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural NetworksDatabricks
 
The Blockchain and the Future of Cybersecurity
The Blockchain and the Future of CybersecurityThe Blockchain and the Future of Cybersecurity
The Blockchain and the Future of CybersecurityKevin Cedeño, CISM, CISA
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning systemswapnac12
 
Evaluating hypothesis
Evaluating  hypothesisEvaluating  hypothesis
Evaluating hypothesisswapnac12
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Design of Secure Hash Algorithm(SHA)
Design of Secure Hash Algorithm(SHA)Design of Secure Hash Algorithm(SHA)
Design of Secure Hash Algorithm(SHA)Saravanan T.M
 
Introduction to Cryptography
Introduction to CryptographyIntroduction to Cryptography
Introduction to CryptographyMd. Afif Al Mamun
 
Lecture 5 ip security
Lecture 5 ip securityLecture 5 ip security
Lecture 5 ip securityrajakhurram
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social networkakash_mishra
 
Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber SecurityRishi Kant
 
Blockchain Security and Privacy
Blockchain Security and PrivacyBlockchain Security and Privacy
Blockchain Security and PrivacyAnil John
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data MiningKamal Acharya
 
08. spectal clustering
08. spectal clustering08. spectal clustering
08. spectal clusteringJeonghun Yoon
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLPAnuj Gupta
 
Public Key Cryptosystem
Public Key CryptosystemPublic Key Cryptosystem
Public Key CryptosystemDevakumar Kp
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 

What's hot (20)

Web Security
Web SecurityWeb Security
Web Security
 
Text clustering
Text clusteringText clustering
Text clustering
 
Introduction to Neural Networks
Introduction to Neural NetworksIntroduction to Neural Networks
Introduction to Neural Networks
 
The Blockchain and the Future of Cybersecurity
The Blockchain and the Future of CybersecurityThe Blockchain and the Future of Cybersecurity
The Blockchain and the Future of Cybersecurity
 
Introdution and designing a learning system
Introdution and designing a learning systemIntrodution and designing a learning system
Introdution and designing a learning system
 
Evaluating hypothesis
Evaluating  hypothesisEvaluating  hypothesis
Evaluating hypothesis
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
06 Community Detection
06 Community Detection06 Community Detection
06 Community Detection
 
Design of Secure Hash Algorithm(SHA)
Design of Secure Hash Algorithm(SHA)Design of Secure Hash Algorithm(SHA)
Design of Secure Hash Algorithm(SHA)
 
Introduction to Cryptography
Introduction to CryptographyIntroduction to Cryptography
Introduction to Cryptography
 
Lecture 5 ip security
Lecture 5 ip securityLecture 5 ip security
Lecture 5 ip security
 
Data mining in social network
Data mining in social networkData mining in social network
Data mining in social network
 
Machine Learning in Cyber Security
Machine Learning in Cyber SecurityMachine Learning in Cyber Security
Machine Learning in Cyber Security
 
Blockchain Security and Privacy
Blockchain Security and PrivacyBlockchain Security and Privacy
Blockchain Security and Privacy
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
08. spectal clustering
08. spectal clustering08. spectal clustering
08. spectal clustering
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Public Key Cryptosystem
Public Key CryptosystemPublic Key Cryptosystem
Public Key Cryptosystem
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 

Similar to FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Christophe Salperwyck, Akamai Technologies

Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Larry Smarr
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1blewington
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrationsinside-BigData.com
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suitesmarru
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and ComputationTal Lavian Ph.D.
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...DataWorks Summit/Hadoop Summit
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkDatabricks
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009Ian Foster
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Florian Lautenschlager
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudWolfgang Gentzsch
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 

Similar to FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Christophe Salperwyck, Akamai Technologies (20)

Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1MIG 5th Data Centre Summit 2016 PTS Presentation v1
MIG 5th Data Centre Summit 2016 PTS Presentation v1
 
Applying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System IntegrationsApplying Cloud Techniques to Address Complexity in HPC System Integrations
Applying Cloud Techniques to Address Complexity in HPC System Integrations
 
Ogce Workflow Suite
Ogce Workflow SuiteOgce Workflow Suite
Ogce Workflow Suite
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
TeraGrid Communication and Computation
TeraGrid Communication and ComputationTeraGrid Communication and Computation
TeraGrid Communication and Computation
 
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
 
Cooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache SparkCooperative Task Execution for Apache Spark
Cooperative Task Execution for Apache Spark
 
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
Smart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the CloudSmart Manufacturing: CAE in the Cloud
Smart Manufacturing: CAE in the Cloud
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 

FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Christophe Salperwyck, Akamai Technologies

  • 1. FlinkDTW Time-series pattern search at scale using Dynamic Time Warping Christophe Salperwyck, Akamai Kraków https://www.linkedin.com/in/christophesalperwyck/
  • 2. 2
  • 3. Few words about Akamai 3
  • 4. 4 Akamai is a leader in Content Delivery Network (CDN) services for delivering, optimizing and securing online content and business applications. Founded in 1998 and rooted in MIT technology. Solving Internet congestion with math not hardware.
  • 5. 5
  • 6. Akamai Intelligent Edge (HTTP+HTTPS+DNS+MQTT) 240,000+ Servers 1300+ Cities 140+ Countries on 7 Continents 82.4 Tbps Peaks
  • 8. 8
  • 9. Bio Software engineer who moved to data mining/science/analytics/... ⇒ PhD in stream mining (2012) A Survey on Supervised Classification on Data Streams Interest in Machine Learning at scale https://www.slideshare.net/Hadoop_Summit/courbospark-decision-tree-for-timeseries-on-spark Used to work on Hadoop/HBase to store plants sensor / time series (1,000B points - 100TB) https://www.slideshare.net/HadoopSummit/a-data-lake-and-a-data-lab-to-optimize-operations-and-safety-within-a-nuclear-fleet Online learning - combining decision stump/tree to pick the best ad https://www.slideshare.net/ChristopheSalperwyck/explorationexploitation2011salperwyckurvoycontr01 9
  • 10. 1. Time series? 2. DTW: Dynamic Time Warping 3. Bibliography on Fast/Parallelize DTW 4. Use-case 5. Benchmark 6. Conclusion and Future works 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 15. 15 Many data are time series! ➔ IoT/IIoT data ➔ Sales/Marketing data ➔ Monitoring data: data centers, network... ➔ Science/Medicine: Earthquake, EEG, ECG, DNA... ➔ Social network: likes over time per specific category ➔ ...
  • 16. What is a time series? 16 Wikipedia: "A time series is a series of data points indexed in time order." In Flink world: <seriesId, timestamp, value> ⇒ Tuple3<String, Long, Double>
  • 17. Time series pre processing / cleaning? 17 ➔ Outliers ➔ Removing abnormal periods (too many missing values...) ➔ Filling gaps (with last value, interpolation...) ➔ Removing seasonality ➔ Subsampling if needed ➔ Transformations (FFT...) ➔ ...
  • 19. Pattern search using DTW Dynamic Time Warping 19
  • 21. Which distance to use? 21
  • 22. DTW algorithm 22 Searching and mining trillions of time series subsequences under dynamic time warping – Rakthanmanon et al. SIGKDD 2012
  • 23. Fast/Parallelize DTW How to speed it up? 23
  • 24. UCR DTW - best KDD paper 2012 24 Searching and mining trillions of time series subsequences under dynamic time warping Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, and Eamonn Keogh KDD '12 https://www.cs.ucr.edu/~eamonn/UCRsuite.html An influential paper on gesture recognition on multi-touch screens laments that “DTW took 128.6 minutes to run the 14,400 tests for a give subject’s 160 gestures.” However, we can reproduce the results in under 3 seconds.
  • 25. Why is it so fast? Early abandoning! 25 R ⇒ Wrapping band (path deviation) n ⇒ Query length
  • 26. Related work 26 Spark (2015) - large scale Parallelization of Searching and Mining Time Series Data using Dynamic Time Warping Shabib, Ahmed & Narang, Anish & Prasad Niddodi, Chaitra & Das, Madhura & Pradeep, Rachita & Shenoy, Varun & Auradkar, Prafullata & TS, Vignesh & Sitaram, Dinkar. International Conference on Advances in Computing, Communications and Informatics (ICACCI) Flink (2019) - fast detection Time Series Similarity Search for Streaming Data in Distributed Systems Ziehn, Ariane & Charfuelan Oliva, Marcela & Hemsen, Holmer & Markl, Volker. Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference. Data Analytics Solutions for Real-Life Applications.
  • 27. Use case utilities: grid frequency 27
  • 28. In the event of a frequency variation consisting of a downward ramp of Δf = 50 mHz in 10 s followed by a stabilised regime, where the programmed Frequency Containment Reserve is greater than K.Δf, the Generation Unit must release: - 50% of the expected variation K.Δf in 20 s for Reserve Entities made up of Thermal Generation Units (in 100 s for Reserve Entities made up of Hydroelectric Generation Units); - 90% of the expected variation K.Δf in 60 s for Reserve Entities made up of Thermal Generation Units (in 300 s for Reserve Entities made up of Hydroelectric Generation Units). https://www.next-kraftwerke.com/energy-blog/who-is-disrupting-the-utility-frequency http://clients.rte-france.com/htm/an/offre/telecharge/20140101_Regles_SSY_approuvees_an.pdf http://clients.rte-france.com/htm/fr/offre/telecharge/20181026_Regles_services_systeme_frequence.pdf https://www.mainsfrequency.com/frequ_info_en.php 28 Grid frequency: regulation
  • 29. Experiments Data: open data https://www.nationalgrideso.com/balancing-services/frequency-response-services/historic-frequency-data Data size: 168M (~5 years at 1s) Time to read the data: 45 seconds (AVRO) Pattern search: < 1 minute (pattern size from 100s to 500s) Search speed: > 3M points/second 29
  • 32. Some stats on pruning We almost never compute the full DTW! Example: Pruned by LB_KimFL: 95% Pruned by LB_Keogh: 5% Full DTW Calculation: 0.008% 32
  • 33. Some issues We have to handle: ➔ change of partitions in the code ➔ search at partition splits (not to lose any detections) 33
  • 35. Settings 35 Hardware Laptop ⇒ i5, 16 GB, SSD… Tested with 1 thread (unfortunately!) Some first tests with Kubernetes Streaming Data Generator Random walk: ~2M to 3M points/second (without Flink 20M+/s)
  • 36. Performance for a jumping window of 10,000 36
  • 37. Performance for a jumping window of 1M 37
  • 38. Pruning vs window size (R=10%, Query size=100) 38 Window = 10,000 Window = 1,000,000
  • 39. Performance for a jumping window of 10,000 39
  • 40. Performance for a jumping window of 1M 40
  • 41. Streaming issues ➔ Jumping windows ⇒ we might miss some detections at the junction ➔ Can be fixed using sliding windows but for large sliding windows, "evict" on the CountEvictor is slow. 41
  • 42. Kubernetes configuration 42 40 Flink task managers deployed (4GB per TM)
  • 43. First runs in k8s... 43 100M 4 Billion (8M/s)
  • 46. Conclusion ➔ Original algorithm really works fast! ⇒ easy to use as is and to take advantage of Flink directly ➔ Can be use on massive past data very efficiently ➔ Can be use on streaming data but would need some tweakings for better performances on small windows 46
  • 47. Future works ➔ Dynamically change the patterns using a stream of update on the patterns ➔ Use Flink for pre filtering windows (min/max, CEP...) ➔ Continue testing on Kubernetes cluster ➔ Optimization for smaller windows? ➔ Use Fold function instead of Process? 47
  • 48. Which Flink function to use? ProcessWindowFunction A ProcessWindowFunction gets an Iterable containing all the elements of the window, and a Context object with access to time and state information, which enables it to provide more flexibility than other window functions. This comes at the cost of performance and resource consumption, because elements cannot be incrementally aggregated but instead need to be buffered internally until the window is considered ready for processing. FoldFunction A FoldFunction specifies how an input element of the window is combined with an element of the output type. The FoldFunction is incrementally called for each element that is added to the window and the current output value. The first element is combined with a pre-defined initial value of the output type. 48
  • 50. Flink - Online ML with MOA 50
  • 51. Online machine learning with Flink and MOA Blogpost: https://moa.cms.waikato.ac.nz/moa-with-apache-flink/ GitHub repo: https://github.com/csalperwyck/ - moa.flink.traintest: - Train a model on a stream, test/deploy it on another one - Flink take care of pushing model updates: CoFlatMapFunction - moa.flink.ozabag: - Train many models in parallel (Random Forest for example) - Dynamic scaling should work on this kind of workload! 51