Baymeetup-FlinkResearch

Apache Flink Research
A look into the future
Paris Carbone - PhD Candidate
KTH Royal Institute of Technology
<parisc@kth.se, senorcarbone@apache.org>
1

2
’95
Materialised
Views
’01
Complex
Event
Processing
’03
TelegraphCQ
’03
STREAM
’05
Borealis
’15
Advanced
Windowing
(session, watermarks,
user-deﬁned)
’12
Policy-Based
Windowing
’88
Active
DataBases
’88
HiPac
’12
Twitter
Storm
’12
IBM
System S
’13
Spark
Streaming
’14
Apache
Flink
’13
Parallel
Recovery
’05
Decentralised
Stream Queries
’05
High Availability
on Streaming
concepts
systems
’13
Google
Millwheel
’13
Discretized
Streams
’00
Eddies
02
Aurora
’15
Google
Dataﬂow

Research in Flink
• Many ideas behind Flink were research products
• Job plan optimiser
• Efﬁcient joins
• Memory management
• Execution engine always was a streaming engine
3

Our Focus
• Contributions already in the current release
• Streaming Semantics - Expressive Windowing
• State Management (representation, handling)
• Graph Semantics - Gelly
• Exactly-once-processing (checkpointing)
4

Ongoing Research
• Advanced State Management & Fault Tolerance
• Pre-aggregate sharing for sliding windows
• Streaming ML Pipelines
• Streaming Graphs
• Experiment Reproducibility
5

Current Focus
6
Streaming APIBatch API
Flink Optimiser
Flink Runtime
Table
ML
Gelly
ML
Gelly
StateManagement

Lessons Learned from
Batch
7
batch-1batch-2
• If a batch computation fails, simply repeat computation
as a transaction (if we have repeatable sources)
• Transaction rate is constant
• Can we apply these principles to a true streaming
execution?

Distributed Snapshots
8
t3t2
execution snapshots
t1
reset from t2

Taking Snapshots
9
t2t1
execution snapshots
Initial approach (see Naiad)
• Pause execution on t1,t2,..
• Collect state
• Restore execution

Asynchronous Snapshots
10
t2t1
snap - t1 snap - t2
snapshotting snapshotting

Sliding Window
Optimisations
11
Managed
Memory
window
operator
merge
tree
Windowing
Policies

ML Pipelines
12
training set test set
Flink ML
ETL
Transformers
Learners
Evaluators
training
stream
test
stream
Flink
Streaming ML
stream ETL
concept drift detection
anomaly detection
online learning
online classiﬁcation

ML on Unbounded Data
13
• We are often interested in:
• Low latency approximations on a single pass
• Instant classiﬁcation on stream ingestion with higher
error bounds
• Continous aggregates on unbounded data synopses
(e.g. stream sampling)

Streaming ML
14
Table
ML
Gelly
ML
Gelly
bounded data
multi-pass algorithms
bulk classiﬁcation
unbounded data
single-pass algorithms
instant classiﬁcation

ML Use Cases
15
Batch ML Streaming ML
SVM Anomaly Detection
Clustering Concept Drift Detection
Col. Filtering (matrix
factorisation)
Incremental Clustering
Rank Estimation Dec. Tree and Rule Mining
Similarity Matching
Approximations (freq itemsets,
distinct items, samples etc.)

Stream ML Abstractions
16
• Reusing the same abstractions from the batch ML library
(e.g. Transformer, Learner, Evaluator)
• plus some more abstractions (e.g. Drift Detector)
https://github.com/senorcarbone/ﬂink/tree/incremental-ml

Example: Vertical Hoeffding
Trees
• Building a decision tree on-the-ﬂy
• Parallelizing attribute metric computation (vertical
parallelization)
17

input
VHT Pipeline Deﬁnition
18
input
VHT
Learner
DataPoints
Prequential
Evaluator
Instance
Classiﬁcation

Modelling complex
pipelines
19
Transformer Learner Evaluator
change
reset
error

Or even more complex
pipelines
20
Transformer Learner Evaluator
change
error
Batch ML
Pipeline
correct
schedule
Integrating Batch and Streaming ML

Unbounded Graph Analysis
21
• Graphs are often created by a snapshot of a stream of
events: user interactions, product purchases, clicks, etc.
• Can we process the graph as a stream, immediately when
it arrives in the system?
• We can leverage existing research on one-pass streaming
algorithms and Flink’s streaming engine

Streaming Graphs?
22
Table
ML
Gelly
ML
Gelly
bounded graph data
multi-pass algorithms (BSP)
exact computations
unbounded graph data
single-pass algorithms
incremental computations

Graph Use Cases
23
Batch
Multi-Pass (BSP)
Streaming
Single-Pass
Graph Traversal Degree Estimation
Rank Estimation
Property Check (Bipartitness/
Connectivity)
Connected Components Max Cardinality Matching
Shortest Paths Triangle Count

API Preview
24
https://github.com/vasia/gelly-streaming

Example: Top-k Inﬂuential
Users
25
DataStream<UserID>-topUsers-=--
GraphStream.fromDataStream(new-TwitterSource())-
.filterOnVertices(new-FilterUserByFollowers(1000))-
.filterOnEdges(new-FilterHashTag(‘#graphs’))-
.outDegrees().topK(1,-10);-
extracts user data
and hashtags from
the tweet
filter out users
with <1000
followers
filter out edges with
irrelevant hashtags
the out-degree will
be the number of
relevant tweets

Experiments -
Reproducibility
• Deﬁning, Deploying, orchestrating and collecting results
for experiments is a big hustle!
• A single experiment will need
• devops hours to allocate VMs, fetch the right versions
and install system dependencies in the correct order
• dev hours to write scripts for data processing/collection
• Repeating a benchmark/experiment is impossible without
all the low level conﬁguration details
26

Introducing Karamel
27
standalone web app
karamel
file
karamelized
cookbooks
• Simplifying system dependencies to a
bare minimum
• Simple integration for existing cookbooks
(chef) by adding a Karamel file
• Compositional cluster definitions
• Tight integration with Github
yaml

Flink in Karamel
28
• Demo
• https://www.youtube.com/watch?v=m_SkhyMV0to

Baymeetup-FlinkResearch

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Baymeetup-FlinkResearch

Similar to Baymeetup-FlinkResearch (20)

Recently uploaded

Recently uploaded (20)

Baymeetup-FlinkResearch