Architecture of Flink's
Streaming Runtime
Robert Metzger
@rmetzger_
rmetzger@apache.org
Chicago Apache Flink Meetup,
October 29, 2015
What is stream processing
 Real-world data is unbounded and is
pushed to systems
 Right now: people are using the batch
paradigm for stream analysis (there was
no good stream processor available)
 New systems (Flink, Kafka) embrace
streaming nature of data
2
Web server Kafka topic
Stream processing
3
Flink is a stream processor with many faces
Streaming dataflow runtime
Flink's streaming runtime
4
Requirements for a stream processor
 Low latency
• Fast results (milliseconds)
 High throughput
• handle large data amounts (millions of events
per second)
 Exactly-once guarantees
• Correct results, also in failure cases
 Programmability
• Intuitive APIs
5
Pipelining
6
Basic building block to “keep the data moving”
• Low latency
• Operators push
data forward
• Data shipping as
buffers, not tuple-
wise
• Natural handling
of back-pressure
Fault Tolerance in streaming
 at least once: ensure all operators see all
events
• Storm: Replay stream in failure case
 Exactly once: Ensure that operators do
not perform duplicate updates to their
state
• Flink: Distributed Snapshots
• Spark: Micro-batches on batch runtime
7
Flink’s Distributed Snapshots
 Lightweight approach of storing the state
of all operators without pausing the
execution
 high throughput, low latency
 Implemented using barriers flowing
through the topology
8
Kafka
Consumer
offset = 162
Element
Counter
value = 152
Operator
stateData Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
(backup till next snapshot)
9
10
11
12
Best of all worlds for streaming
 Low latency
• Thanks to pipelined engine
 Exactly-once guarantees
• Distributed Snapshots
 High throughput
• Controllable checkpointing overhead
 Separates app logic from recovery
• Checkpointing interval is just a config parameter
13
Throughput of distributed grep
14
Data
Generator
“grep”
operator
30 machines, 120 cores
0
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
Flink, no fault
tolerance
Flink, exactly
once (5s)
Storm, no
fault tolerance
Storm, micro-
batches
aggregate throughput
of 175 million
elements per second
aggregate throughput
of 9 million elements
per second
• Flink achieves 20x
higher throughput
• Flink throughput
almost the same
with and without
exactly-once
Aggregate throughput for stream record
grouping
15
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
90,000,000
100,000,000
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, no
fault
tolerance
Storm, at
least once
aggregate throughput
of 83 million elements
per second
8,6 million elements/s
309k elements/s  Flink achieves 260x
higher throughput with
fault tolerance
30 machines,
120 cores Network
transfer
Latency in stream record grouping
16
Data
Generator
Receiver:
Throughput /
Latency measure
• Measure time for a record to
travel from source to sink
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Flink, no
fault
tolerance
Flink, exactly
once
Storm, at
least once
Median latency
25 ms
1 ms
0.00
10.00
20.00
30.00
40.00
50.00
60.00
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, at
least
once
99th percentile
latency
50 ms
17
Exactly-Once with YARN Chaos Monkey
 Validate exactly-once guarantees with
state-machine
18
“Faces” of Flink
19
Faces of a stream processor
20
Stream
processing
Batch
processing
Machine Learning at scale
Graph Analysis
Streaming dataflow runtime
The Flink Stack
21
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
APIs for stream and batch
22
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
The Flink Stack
23
Streaming dataflow runtime
DataSet (Java/Scala) DataStream (Java/Scala)
Experimental
Python API also
available
Data Source
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
API independent Dataflow
Graph representation
Batch Optimizer Graph Builder
Batch is a special case of streaming
 Batch: run a bounded stream (data set) on
a stream processor
 Form a global window over the entire data
set for join or grouping operations
24
Batch-specific optimizations
 Managed memory on- and off-heap
• Operators (join, sort, …) with out-of-core
support
• Optimized serialization stack for user-types
 Cost-based Optimizer
• Job execution depends on data size
25
The Flink Stack
26
Streaming dataflow runtime
Specialized
Abstractions
/ APIs
Core APIs
Flink Core
Runtime
Deployment
DataSet (Java/Scala) DataStream
FlinkML: Machine Learning
 API for ML pipelines inspired by scikit-learn
 Collection of packaged algorithms
• SVM, Multiple Linear Regression, Optimization, ALS, ...
27
val trainingData: DataSet[LabeledVector] = ...
val testingData: DataSet[Vector] = ...
val scaler = StandardScaler()
val polyFeatures = PolynomialFeatures().setDegree(3)
val mlr = MultipleLinearRegression()
val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)
pipeline.fit(trainingData)
val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
Gelly: Graph Processing
 Graph API and library
 Packaged algorithms
• PageRank, SSSP, Label Propagation, Community
Detection, Connected Components
28
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Graph<Long, Long, NullValue> graph = ...
DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run(
new LabelPropagation<Long>(30)).getVertices();
verticesWithCommunity.print();
env.execute();
Flink Stack += Gelly, ML
29
Gelly
ML
DataSet (Java/Scala) DataStream
Streaming dataflow runtime
Integration with other systems
30
SAMOA
DataSet DataStream
HadoopM/R
GoogleDataflow
Cascading
Storm
Zeppelin
• Use Hadoop Input/Output Formats
• Mapper / Reducer implementations
• Hadoop’s FileSystem implementations
• Run applications implemented against Google’s Data Flow API
on premise with Flink
• Run Cascading jobs on Flink, with almost no code change
• Benefit from Flink’s vastly better performance than
MapReduce
• Interactive, web-based data exploration
• Machine learning on data streams
• Compatibility layer for running Storm code
• FlinkTopologyBuilder: one line replacement for
existing jobs
• Wrappers for Storm Spouts and Bolts
• Coming soon: Exactly-once with Storm
• Build-in serializer support for
Hadoop’s Writable type
Deployment options
Gelly
Table
ML
SAMOA
DataSet(Java/Scala)DataStream
Hadoop
LocalClusterYARNTezEmbedded
Dataflow
Dataflow
MRQL
Table
Cascading
Streamingdataflowruntime
Storm
Zeppelin
• Start Flink in your IDE / on your machine
• Local debugging / development using the
same code as on the cluster
• “bare metal” standalone installation of Flink
on a cluster
• Flink on Hadoop YARN (Hadoop 2.2.0+)
• Restarts failed containers
• Support for Kerberos-secured YARN/HDFS
setups
The full stack
32
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream
HadoopM/R
Local Cluster Yarn Tez Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading
Streaming dataflow runtime
Storm(WiP)
Zeppelin
Closing
33
tl;dr Summary
Flink is a software stack of
 Streaming runtime
• low latency
• high throughput
• fault tolerant, exactly-once data processing
 Rich APIs for batch and stream processing
• library ecosystem
• integration with many systems
 A great community of devs and users
 Used in production
34
What is currently happening?
 Features for upcoming 0.10 release:
• Master High Availability
• Vastly improved monitoring GUI
• Watermarks / Event time processing /
Windowing rework
• Stable streaming API (no more “beta” flag)
 Next release: 1.0
35
How do I get started?
36
Mailing Lists: (news | user | dev)@flink.apache.org
Twitter: @ApacheFlink
Blogs: flink.apache.org/blog, data-artisans.com/blog/
IRC channel: irc.freenode.net#flink
Start Flink on YARN in 4 commands:
# get the hadoop2 package from the Flink download page at
# http://flink.apache.org/downloads.html
wget <download url>
tar xvzf flink-0.9.1-bin-hadoop2.tgz
cd flink-0.9.1/
./bin/flink run -m yarn-cluster -yn 4 ./examples/flink-java-
examples-0.9.1-WordCount.jar
flink.apache.org 37
• Check out the slides: http://flink-
forward.org/?post_type=session
• Video recordings will be available very soon
Appendix
38
Managed (off-heap) memory and out-of-
core support
39Memory runs out
Cost-based Optimizer
40
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Combine
GroupRed
sort
DataSource
orders.tbl
Filter
Map DataSource
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
forward
Best plan
depends on
relative sizes
of input files
41
case class Path (from: Long, to:
Long)
val tc = edges.iterate(10) {
paths: DataSet[Path] =>
val next = paths
.join(edges)
.where("to")
.equalTo("from") {
(path, edge) =>
Path(path.from, edge.to)
}
.union(paths)
.distinct()
next
}
Optimizer
Type extraction
stack
Task
scheduling
Dataflow
metadata
Pre-flight (Client)
JobManager
TaskManagers
Data
Source
orders.tbl
Filter
Map
DataSourc
e
lineitem.tbl
Join
Hybrid Hash
build
HT
probe
hash-part [0] hash-part [0]
GroupRed
sort
forward
Program
Dataflow
Graph
deploy
operators
track
intermediate
results
LocalCluster:YARN,Standalone
Iterative processing in Flink
Flink offers built-in iterations and delta
iterations to execute ML and graph
algorithms efficiently
42
map
join sum
ID1
ID2
ID3
Example: Matrix Factorization
43
Factorizing a matrix with
28 billion ratings for
recommendations
More at: http://data-artisans.com/computing-recommendations-with-flink.html
44
Batch
aggregation
ExecutionGraph
JobManager
TaskManager 1
TaskManager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
"Blocked" result partition
45
Streaming
window
aggregation
ExecutionGraph
JobManager
TaskManager 1
TaskManager 2
M1
M2
RP1RP2
R1
R2
1
2 3a
3b
4a
4b
5a
5b
"Pipelined" result partition

Chicago Flink Meetup: Flink's streaming architecture

  • 1.
    Architecture of Flink's StreamingRuntime Robert Metzger @rmetzger_ rmetzger@apache.org Chicago Apache Flink Meetup, October 29, 2015
  • 2.
    What is streamprocessing  Real-world data is unbounded and is pushed to systems  Right now: people are using the batch paradigm for stream analysis (there was no good stream processor available)  New systems (Flink, Kafka) embrace streaming nature of data 2 Web server Kafka topic Stream processing
  • 3.
    3 Flink is astream processor with many faces Streaming dataflow runtime
  • 4.
  • 5.
    Requirements for astream processor  Low latency • Fast results (milliseconds)  High throughput • handle large data amounts (millions of events per second)  Exactly-once guarantees • Correct results, also in failure cases  Programmability • Intuitive APIs 5
  • 6.
    Pipelining 6 Basic building blockto “keep the data moving” • Low latency • Operators push data forward • Data shipping as buffers, not tuple- wise • Natural handling of back-pressure
  • 7.
    Fault Tolerance instreaming  at least once: ensure all operators see all events • Storm: Replay stream in failure case  Exactly once: Ensure that operators do not perform duplicate updates to their state • Flink: Distributed Snapshots • Spark: Micro-batches on batch runtime 7
  • 8.
    Flink’s Distributed Snapshots Lightweight approach of storing the state of all operators without pausing the execution  high throughput, low latency  Implemented using barriers flowing through the topology 8 Kafka Consumer offset = 162 Element Counter value = 152 Operator stateData Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot (backup till next snapshot)
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    Best of allworlds for streaming  Low latency • Thanks to pipelined engine  Exactly-once guarantees • Distributed Snapshots  High throughput • Controllable checkpointing overhead  Separates app logic from recovery • Checkpointing interval is just a config parameter 13
  • 14.
    Throughput of distributedgrep 14 Data Generator “grep” operator 30 machines, 120 cores 0 20,000,000 40,000,000 60,000,000 80,000,000 100,000,000 120,000,000 140,000,000 160,000,000 180,000,000 200,000,000 Flink, no fault tolerance Flink, exactly once (5s) Storm, no fault tolerance Storm, micro- batches aggregate throughput of 175 million elements per second aggregate throughput of 9 million elements per second • Flink achieves 20x higher throughput • Flink throughput almost the same with and without exactly-once
  • 15.
    Aggregate throughput forstream record grouping 15 0 10,000,000 20,000,000 30,000,000 40,000,000 50,000,000 60,000,000 70,000,000 80,000,000 90,000,000 100,000,000 Flink, no fault tolerance Flink, exactly once Storm, no fault tolerance Storm, at least once aggregate throughput of 83 million elements per second 8,6 million elements/s 309k elements/s  Flink achieves 260x higher throughput with fault tolerance 30 machines, 120 cores Network transfer
  • 16.
    Latency in streamrecord grouping 16 Data Generator Receiver: Throughput / Latency measure • Measure time for a record to travel from source to sink 0.00 5.00 10.00 15.00 20.00 25.00 30.00 Flink, no fault tolerance Flink, exactly once Storm, at least once Median latency 25 ms 1 ms 0.00 10.00 20.00 30.00 40.00 50.00 60.00 Flink, no fault tolerance Flink, exactly once Storm, at least once 99th percentile latency 50 ms
  • 17.
  • 18.
    Exactly-Once with YARNChaos Monkey  Validate exactly-once guarantees with state-machine 18
  • 19.
  • 20.
    Faces of astream processor 20 Stream processing Batch processing Machine Learning at scale Graph Analysis Streaming dataflow runtime
  • 21.
    The Flink Stack 21 Streamingdataflow runtime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment
  • 22.
    APIs for streamand batch 22 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 23.
    The Flink Stack 23 Streamingdataflow runtime DataSet (Java/Scala) DataStream (Java/Scala) Experimental Python API also available Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward API independent Dataflow Graph representation Batch Optimizer Graph Builder
  • 24.
    Batch is aspecial case of streaming  Batch: run a bounded stream (data set) on a stream processor  Form a global window over the entire data set for join or grouping operations 24
  • 25.
    Batch-specific optimizations  Managedmemory on- and off-heap • Operators (join, sort, …) with out-of-core support • Optimized serialization stack for user-types  Cost-based Optimizer • Job execution depends on data size 25
  • 26.
    The Flink Stack 26 Streamingdataflow runtime Specialized Abstractions / APIs Core APIs Flink Core Runtime Deployment DataSet (Java/Scala) DataStream
  • 27.
    FlinkML: Machine Learning API for ML pipelines inspired by scikit-learn  Collection of packaged algorithms • SVM, Multiple Linear Regression, Optimization, ALS, ... 27 val trainingData: DataSet[LabeledVector] = ... val testingData: DataSet[Vector] = ... val scaler = StandardScaler() val polyFeatures = PolynomialFeatures().setDegree(3) val mlr = MultipleLinearRegression() val pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr) pipeline.fit(trainingData) val predictions: DataSet[LabeledVector] = pipeline.predict(testingData)
  • 28.
    Gelly: Graph Processing Graph API and library  Packaged algorithms • PageRank, SSSP, Label Propagation, Community Detection, Connected Components 28 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); Graph<Long, Long, NullValue> graph = ... DataSet<Vertex<Long, Long>> verticesWithCommunity = graph.run( new LabelPropagation<Long>(30)).getVertices(); verticesWithCommunity.print(); env.execute();
  • 29.
    Flink Stack +=Gelly, ML 29 Gelly ML DataSet (Java/Scala) DataStream Streaming dataflow runtime
  • 30.
    Integration with othersystems 30 SAMOA DataSet DataStream HadoopM/R GoogleDataflow Cascading Storm Zeppelin • Use Hadoop Input/Output Formats • Mapper / Reducer implementations • Hadoop’s FileSystem implementations • Run applications implemented against Google’s Data Flow API on premise with Flink • Run Cascading jobs on Flink, with almost no code change • Benefit from Flink’s vastly better performance than MapReduce • Interactive, web-based data exploration • Machine learning on data streams • Compatibility layer for running Storm code • FlinkTopologyBuilder: one line replacement for existing jobs • Wrappers for Storm Spouts and Bolts • Coming soon: Exactly-once with Storm • Build-in serializer support for Hadoop’s Writable type
  • 31.
    Deployment options Gelly Table ML SAMOA DataSet(Java/Scala)DataStream Hadoop LocalClusterYARNTezEmbedded Dataflow Dataflow MRQL Table Cascading Streamingdataflowruntime Storm Zeppelin • StartFlink in your IDE / on your machine • Local debugging / development using the same code as on the cluster • “bare metal” standalone installation of Flink on a cluster • Flink on Hadoop YARN (Hadoop 2.2.0+) • Restarts failed containers • Support for Kerberos-secured YARN/HDFS setups
  • 32.
    The full stack 32 Gelly Table ML SAMOA DataSet(Java/Scala) DataStream HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading Streaming dataflow runtime Storm(WiP) Zeppelin
  • 33.
  • 34.
    tl;dr Summary Flink isa software stack of  Streaming runtime • low latency • high throughput • fault tolerant, exactly-once data processing  Rich APIs for batch and stream processing • library ecosystem • integration with many systems  A great community of devs and users  Used in production 34
  • 35.
    What is currentlyhappening?  Features for upcoming 0.10 release: • Master High Availability • Vastly improved monitoring GUI • Watermarks / Event time processing / Windowing rework • Stable streaming API (no more “beta” flag)  Next release: 1.0 35
  • 36.
    How do Iget started? 36 Mailing Lists: (news | user | dev)@flink.apache.org Twitter: @ApacheFlink Blogs: flink.apache.org/blog, data-artisans.com/blog/ IRC channel: irc.freenode.net#flink Start Flink on YARN in 4 commands: # get the hadoop2 package from the Flink download page at # http://flink.apache.org/downloads.html wget <download url> tar xvzf flink-0.9.1-bin-hadoop2.tgz cd flink-0.9.1/ ./bin/flink run -m yarn-cluster -yn 4 ./examples/flink-java- examples-0.9.1-WordCount.jar
  • 37.
    flink.apache.org 37 • Checkout the slides: http://flink- forward.org/?post_type=session • Video recordings will be available very soon
  • 38.
  • 39.
    Managed (off-heap) memoryand out-of- core support 39Memory runs out
  • 40.
    Cost-based Optimizer 40 DataSource orders.tbl Filter Map DataSource lineitem.tbl Join HybridHash buildHT probe broadcast forward Combine GroupRed sort DataSource orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] hash-part [0,1] GroupRed sort forward Best plan depends on relative sizes of input files
  • 41.
    41 case class Path(from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) JobManager TaskManagers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results LocalCluster:YARN,Standalone
  • 42.
    Iterative processing inFlink Flink offers built-in iterations and delta iterations to execute ML and graph algorithms efficiently 42 map join sum ID1 ID2 ID3
  • 43.
    Example: Matrix Factorization 43 Factorizinga matrix with 28 billion ratings for recommendations More at: http://data-artisans.com/computing-recommendations-with-flink.html
  • 44.
  • 45.

Editor's Notes

  • #15 GCE 30 instances with 4 cores and 15 GB of memory each. Flink master from July, 24th, Storm 0.9.3. All the code used for the evaluation can be found here. Flink 1.5 million elements per second per core Aggregate Throughput in cluster 182 million elements per second. Storm 82,000 elements per second per core Aggregate 0.57 million elements per second Storm with Acknowledge 4,700 elements per second per core, Latency 30-120 milliseconds Trident: 75,000 elements per second per core
  • #16 Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core
  • #17 Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core
  • #18 Flink 0 Buffer timeout: latency median 0 msec, 99 %tile 20 msec 24,500 events per second per core