Apache Flink: Past, Present and Future

Apache Flink
Past, present and future
Gyula Fóra
gyfora@apache.org

What is Apache Flink
2
Distributed Data Flow Processing System
▪ Focused on large-scale data analytics
▪ Unified real-time stream and batch processing
▪ Easy and powerful APIs in Java / Scala (+ Python)
▪ Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source

What is Flink good at
3
It‘s a general-purpose data analytics system
▪ Real-time stream processing with flexible windowing
▪ Complex and heavy ETL jobs
▪ Analyzing huge graphs
▪ Machine learning on large data sets and streams
▪ …

The Flink Stack
4
Python
Gelly
Table
ML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Streaming optimizer
Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow

Word count in Flink
5
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):

Table API
6
val orders = env.readCsvFile(…)
.as('oId, 'oDate, 'shipPrio)
.filter('shipPrio === 5)
val items = orders
.join(lineitems).where('oId === 'id)
.select('oId, 'oDate, 'shipPrio,
'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)
val result = items
.groupBy('oId, 'oDate, 'shipPrio)
.select('oId, 'revenue.sum, 'oDate, 'shipPrio)
▪ Execute SQL-like expressions on table data
• Tight integration with Java and Scala APIs
• Available for batch and streaming programs

9
Stratosphere Optimizer
DataSet API (Java)
Stratosphere Runtime
DataSet API (Scala)
Stratosphere 0.5
Local Remote Yarn
Key new features
• New Java API
• Distributed cache
• Collection data sources and
sinks
• JDBC data sources and sinks
• Hadoop I/O format
• Avro support

10
Flink Optimizer
DataSet (Java/Scala)
Flink Runtime
Flink 0.7
DataStream (Java)
Stream Builder
Hadoop
M/R
Local Remote Yarn Embedded
Key new features
• Unification of Java and Scala
APIs
• Logical keys/POJO support
• MR compatibility
• Collections backend
• Extended filesystem support

11
Flink Runtime
Flink 0.8
Flink Optimizer
Stream Builder
Hadoop
M/R
Local Remote Yarn Embedded
Key new features
• Improved filesystem support
• DataStream Scala
• Streaming windows
• Lots of performance and
stability
• Kryo default serializer

12
Python
Gelly
Table
ML
SAMOA
Current master (0.9-Snapshot)
Batch Optimizer
Stream Optimizer
Hadoop
M/R
New Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Key new features
• New runtime
• Tez mode
• Python API
• Gelly
• Flinq
• FlinkML
• Streaming FT

Flink community
13
#unique contributors by git commits
(without manual de-dup)

Summary
▪ The project has a lot of momentum with major
improvements every release
▪ Healthy community
▪ Project diversification
• Real-time data streaming
• Several frontends (targeting different user profiles
and use cases)
• Several backends (targeting different production
settings)
▪ Integration with open source ecosystem
14

What are we building?
16
A "use-case complete" framework to unify
batch & stream processing
Flink
Data Streams
• Kafka
• RabbitMQ
• ...
“Historic” data
• HDFS
• JDBC
• ...
Analytical Workloads
• ETL
• Relational processing
• Graph analysis
• Machine learning
• Streaming data analysis

Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event logs
An engine that puts equal emphasis to
stream and batch processing
Real-time data
streams
What are we building?
(master)

Integrating batch with
streaming
18

Why?
▪ Applications need to combine streaming and
static data sources
▪ Making the switch from batch to streaming easy
will be key to boost adoption
▪ Companies are making the transition from batch
to streaming now
19

What is stream processing?
20
▪ Data stream: Infinite sequence of data arriving
in a continuous fashion
▪Stream processing: Analyzing and acting on
real-time streaming data, using continuous
queries

Lambda architecture
▪ "Speed layer" can be a stream processing system
▪ "Picks up" after the batch layer
21

Kappa architecture
▪ Need for batch & speed layer not
fundamental, practical with current tech
▪ Idea: use a stream processing system for all
data processing
▪ They are all dataflows anyway
22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Data streaming with Flink
▪ Flink is building a proper stream
processing system
• that can execute both batch and stream jobs
natively
• batch-only jobs pass via different optimization
code path
▪ Flink is building libraries and DSLs on top
of both batch and streaming
• e.g., see recent Table API
23

Data streaming with Flink
▪ Low-latency stream processor
▪ Expressive APIs in Scala/Java
▪ Stateful operators and flexible windowing
▪ Efficient fault tolerance for exactly-once
guarantees
24

Summary
▪ Flink is a general-purpose data analytics
system
▪ Unifies batch and stream processing
▪ Expressive high-level APIs
▪ Robust and fast execution engine
25

Apache Flink: Past, Present and Future

More Related Content

What's hot

Similar to Apache Flink: Past, Present and Future

More from Gyula Fóra

Recently uploaded

Apache Flink: Past, Present and Future