Apache Flink Research
A look into the future
Paris Carbone - PhD Candidate
KTH Royal Institute of Technology
<parisc@kth.se, senorcarbone@apache.org>
1
2
’95
Materialised
Views
’01
Complex
Event
Processing
’03
TelegraphCQ
’03
STREAM
’05
Borealis
’15
Advanced
Windowing
(session, watermarks,
user-defined)
’12
Policy-Based
Windowing
’88
Active
DataBases
’88
HiPac
’12
Twitter
Storm
’12
IBM
System S
’13
Spark
Streaming
’14
Apache
Flink
’13
Parallel
Recovery
’05
Decentralised
Stream Queries
’05
High Availability
on Streaming
concepts
systems
’13
Google
Millwheel
’13
Discretized
Streams
’00
Eddies
02
Aurora
’15
Google
Dataflow
Research in Flink
• Many ideas behind Flink were research products
• Job plan optimiser
• Efficient joins
• Memory management
• Execution engine always was a streaming engine
3
Our Focus
• Contributions already in the current release
• Streaming Semantics - Expressive Windowing
• State Management (representation, handling)
• Graph Semantics - Gelly
• Exactly-once-processing (checkpointing)
4
Ongoing Research
• Advanced State Management & Fault Tolerance
• Pre-aggregate sharing for sliding windows
• Streaming ML Pipelines
• Streaming Graphs
• Experiment Reproducibility
5
Current Focus
6
Streaming APIBatch API
Flink Optimiser
Flink Runtime
Table
ML
Gelly
ML
Gelly
StateManagement
Lessons Learned from
Batch
7
batch-1batch-2
• If a batch computation fails, simply repeat computation
as a transaction (if we have repeatable sources)
• Transaction rate is constant
• Can we apply these principles to a true streaming
execution?
Distributed Snapshots
8
t3t2
execution snapshots
t1
reset from t2
Taking Snapshots
9
t2t1
execution snapshots
Initial approach (see Naiad)
• Pause execution on t1,t2,..
• Collect state
• Restore execution
Asynchronous Snapshots
10
t2t1
snap - t1 snap - t2
snapshotting snapshotting
Sliding Window
Optimisations
11
Managed
Memory
window
operator
merge
tree
Windowing
Policies
ML Pipelines
12
training set test set
Flink ML
ETL
Transformers
Learners
Evaluators
training
stream
test
stream
Flink
Streaming ML
stream ETL
concept drift detection
anomaly detection
online learning
online classification
ML on Unbounded Data
13
• We are often interested in:
• Low latency approximations on a single pass
• Instant classification on stream ingestion with higher
error bounds
• Continous aggregates on unbounded data synopses
(e.g. stream sampling)
Streaming ML
14
Streaming APIBatch API
Table
ML
Gelly
ML
Gelly
bounded data
multi-pass algorithms
bulk classification
unbounded data
single-pass algorithms
instant classification
ML Use Cases
15
Batch ML Streaming ML
SVM Anomaly Detection
Clustering Concept Drift Detection
Col. Filtering (matrix
factorisation)
Incremental Clustering
Rank Estimation Dec. Tree and Rule Mining
Similarity Matching
Approximations (freq itemsets,
distinct items, samples etc.)
Stream ML Abstractions
16
• Reusing the same abstractions from the batch ML library
(e.g. Transformer, Learner, Evaluator)
• plus some more abstractions (e.g. Drift Detector)
https://github.com/senorcarbone/flink/tree/incremental-ml
Example: Vertical Hoeffding
Trees
• Building a decision tree on-the-fly
• Parallelizing attribute metric computation (vertical
parallelization)
17
input
VHT Pipeline Definition
18
input
VHT
Learner
DataPoints
Prequential
Evaluator
Instance
Classification
Modelling complex
pipelines
19
Transformer Learner Evaluator
change
reset
error
Or even more complex
pipelines
20
Transformer Learner Evaluator
change
error
Batch ML
Pipeline
correct
schedule
Integrating Batch and Streaming ML
Unbounded Graph Analysis
21
• Graphs are often created by a snapshot of a stream of
events: user interactions, product purchases, clicks, etc.
• Can we process the graph as a stream, immediately when
it arrives in the system?
• We can leverage existing research on one-pass streaming
algorithms and Flink’s streaming engine
Streaming Graphs?
22
Streaming APIBatch API
Table
ML
Gelly
ML
Gelly
bounded graph data
multi-pass algorithms (BSP)
exact computations
unbounded graph data
single-pass algorithms
incremental computations
Graph Use Cases
23
Batch
Multi-Pass (BSP)
Streaming
Single-Pass
Graph Traversal Degree Estimation
Rank Estimation
Property Check (Bipartitness/
Connectivity)
Connected Components Max Cardinality Matching
Shortest Paths Triangle Count
API Preview
24
https://github.com/vasia/gelly-streaming
Example: Top-k Influential
Users
25
DataStream<UserID>-topUsers-=--
GraphStream.fromDataStream(new-TwitterSource())-
.filterOnVertices(new-FilterUserByFollowers(1000))-
.filterOnEdges(new-FilterHashTag(‘#graphs’))-
.outDegrees().topK(1,-10);-
extracts user data
and hashtags from
the tweet
filter out users
with <1000
followers
filter out edges with
irrelevant hashtags
the out-degree will
be the number of
relevant tweets
Experiments -
Reproducibility
• Defining, Deploying, orchestrating and collecting results
for experiments is a big hustle!
• A single experiment will need
• devops hours to allocate VMs, fetch the right versions
and install system dependencies in the correct order
• dev hours to write scripts for data processing/collection
• Repeating a benchmark/experiment is impossible without
all the low level configuration details
26
Introducing Karamel
27
standalone web app
karamel
file
karamelized
cookbooks
• Simplifying system dependencies to a
bare minimum
• Simple integration for existing cookbooks
(chef) by adding a Karamel file
• Compositional cluster definitions
• Tight integration with Github
yaml
Introducing Karamel
27
standalone web app
karamel
file
karamelized
cookbooks
• Simplifying system dependencies to a
bare minimum
• Simple integration for existing cookbooks
(chef) by adding a Karamel file
• Compositional cluster definitions
• Tight integration with Github
yaml
Flink in Karamel
28
• Demo
• https://www.youtube.com/watch?v=m_SkhyMV0to

Baymeetup-FlinkResearch