Robert Metzger presented on the 1 year growth of the Apache Flink community and an overview of Flink's capabilities. Flink can natively support streaming, batch, machine learning, and graph processing workloads by executing everything as data streams, allowing some iterative and stateful operations, and operating on managed memory. Key aspects of Flink streaming include its pipelined processing, expressive APIs, efficient fault tolerance, and flexible windows and state. Batch pipelines in Flink are also executed as streaming programs with some blocking operations. Flink additionally supports SQL-like queries, machine learning algorithms through iterative data flows, and graph analysis through stateful delta iterations.
10. Flink Engine
1. Execute everything as streams
2. Allow some iterative (cyclic) dataflows
3. Allow some mutable state
4. Operate on managed memory
10
17. Expressive APIs
17
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
18. Checkpointing / Recovery
18
Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
Pushes checkpoint barriers
through the data flow
Operator checkpoint
starting
Checkpoint done
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
21. Batch on Streaming
Batch programs are a special kind of
streaming program
21
Infinite Streams Finite Streams
Stream Windows Global View
Pipelined
Data Exchange
Pipelined or
Blocking Exchange
Streaming Programs Batch Programs
29. Iterate by looping
for/while loop in client submits one job per
iteration step
Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
29
37. Flink Roadmap for 2015
Some examples:
More flexible state and state backends in
streaming
Master Failover
Improved monitoring
Integration with other Apache projects
• SAMOA, Zeppelin, Ignite
More additions to the libraries
37
38. Flink Forward registration & call
for abstracts is open now
flink.apache.org 38
• 12. and 13. October 2015
• Kulturbrauerei Berlin
• With Flink Workshops/Training!
43. Examples of optimization
Task chaining
• Coalesce map/filter/etc tasks
Join optimizations
• Broadcast/partition, build/probe side, hash or sort-
merge
Interesting properties
• Re-use partitioning and sorting for later operations
Automatic caching
• E.g., for iterations
43
Editor's Notes
Working on Flink since 2012.
Implemented YARN support
Taking a look back:
in only one year, a lot has happened.We were accepted in the ASF incubator, graduated quite fast …
…. code wise, we are quickly adding new features and functionality (while not forgetting to keep existing users happy with fixes ;) )
I checked a few days ago and found that we’ve doubled the lines of code in one year
we could have never done this alone without a very strong and amazing community.
at the very heart, Flink is a streaming dataflow runtime.
This means operators are running at the same time, sending data to each other. This allows exploiting parallelism, utilize the hardware etc.
To get something out of that runtime, we over programming abstractions.
There are DataSet and DataStream for batch and stream processing.
On top of these APIs, user have build more: ….
So how do we turn a simple java / scala program into a robust distributed program?
type analysis / extraction (=think of it as “schema creation”) … creation of serializers
optimization (data partitioning (global strategy), execution strategy (local strategy))
represented as a dataflow graph (with all the strategies set)
-------- local / remote border ----
d) scheduling & job metadata @master
e) workers process data
What makes Flink special? natively supports a very broad range of use cases
Common use cases are:
- real time stream processing .. you want to process your data as it comes in
large batch pipelines, reading data from many sources, joining, cleaning and analyzing.
not only data intensive use cases, also work intensive use cases (machine learning, graph analysis) … how to intelligently distribute work through the cluster?
iterations through loop unrolling:
needed for many use cases, for example graph and machine learning
explain approach
slow because rescheduling & state recreation necessary
streaming through mini-batches
discretize your stream into “small” sets and process them with your batch system.
high latency because you need to collect & start the batches
How do we achieve this?
everything is treated as data streams. multiple processing steps are happening at the same time. No materialization (=storing the result on disk) between processing steps
We allow streams to have loops (feed in the result of earlier computation) flink is aware of iterative processing, no need to redeploy, can automatically optimize
Users can keep state between iterations (for example a model you are training). in streaming, we backup your state for you
Flink always knows whats going on with its memory (instead of dealing with the “blackbox” GC)
For batch processing (which is often very data intensive) we need …
… explain …
.. so this is nice, but now all the user data is just a bunch of bytes in an array?
the fruits of our hard work
The last highlight of the batch system: the best of both worlds: sql-style for the simple data lifting, custom functions for the complex / heavy stuff