Apache Flink
Past, present and future
Gyula Fóra
gyfora@apache.org
What is Apache Flink
2
Distributed Data Flow Processing System
▪ Focused on large-scale data analytics
▪ Unified real-time stream and batch processing
▪ Easy and powerful APIs in Java / Scala (+ Python)
▪ Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
What is Flink good at
3
It‘s a general-purpose data analytics system
▪ Real-time stream processing with flexible windowing
▪ Complex and heavy ETL jobs
▪ Analyzing huge graphs
▪ Machine learning on large data sets and streams
▪ …
The Flink Stack
4
Python
Gelly
Table
ML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Streaming optimizer
Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Word count in Flink
5
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Table API
6
val orders = env.readCsvFile(…)
.as('oId, 'oDate, 'shipPrio)
.filter('shipPrio === 5)
val items = orders
.join(lineitems).where('oId === 'id)
.select('oId, 'oDate, 'shipPrio,
'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue)
val result = items
.groupBy('oId, 'oDate, 'shipPrio)
.select('oId, 'revenue.sum, 'oDate, 'shipPrio)
▪ Execute SQL-like expressions on table data
• Tight integration with Java and Scala APIs
• Available for batch and streaming programs
A trip down memory lane
7
April 16, 2014
8
9
Stratosphere Optimizer
DataSet API (Java)
Stratosphere Runtime
DataSet API (Scala)
Stratosphere 0.5
Local Remote Yarn
Key new features
• New Java API
• Distributed cache
• Collection data sources and
sinks
• JDBC data sources and sinks
• Hadoop I/O format
• Avro support
10
Flink Optimizer
DataSet (Java/Scala)
Flink Runtime
Flink 0.7
DataStream (Java)
Stream Builder
Hadoop
M/R
Local Remote Yarn Embedded
Key new features
• Unification of Java and Scala
APIs
• Logical keys/POJO support
• MR compatibility
• Collections backend
• Extended filesystem support
11
Flink Runtime
Flink 0.8
Flink Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Builder
Hadoop
M/R
Local Remote Yarn Embedded
Key new features
• Improved filesystem support
• DataStream Scala
• Streaming windows
• Lots of performance and
stability
• Kryo default serializer
12
Python
Gelly
Table
ML
SAMOA
Current master (0.9-Snapshot)
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)
Stream Optimizer
Hadoop
M/R
New Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
Key new features
• New runtime
• Tez mode
• Python API
• Gelly
• Flinq
• FlinkML
• Streaming FT
Flink community
13
#unique contributors by git commits
(without manual de-dup)
Summary
▪ The project has a lot of momentum with major
improvements every release
▪ Healthy community
▪ Project diversification
• Real-time data streaming
• Several frontends (targeting different user profiles
and use cases)
• Several backends (targeting different production
settings)
▪ Integration with open source ecosystem
14
Vision for Flink
15
What are we building?
16
A "use-case complete" framework to unify
batch & stream processing
Flink
Data Streams
• Kafka
• RabbitMQ
• ...
“Historic” data
• HDFS
• JDBC
• ...
Analytical Workloads
• ETL
• Relational processing
• Graph analysis
• Machine learning
• Streaming data analysis
Flink
Historic data
Kafka, RabbitMQ, ...
HDFS, JDBC, ...
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event logs
An engine that puts equal emphasis to
stream and batch processing
Real-time data
streams
What are we building?
(master)
Integrating batch with
streaming
18
Why?
▪ Applications need to combine streaming and
static data sources
▪ Making the switch from batch to streaming easy
will be key to boost adoption
▪ Companies are making the transition from batch
to streaming now
19
What is stream processing?
20
▪ Data stream: Infinite sequence of data arriving
in a continuous fashion
▪Stream processing: Analyzing and acting on
real-time streaming data, using continuous
queries
Lambda architecture
▪ "Speed layer" can be a stream processing system
▪ "Picks up" after the batch layer
21
Kappa architecture
▪ Need for batch & speed layer not
fundamental, practical with current tech
▪ Idea: use a stream processing system for all
data processing
▪ They are all dataflows anyway
22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
Data streaming with Flink
▪ Flink is building a proper stream
processing system
• that can execute both batch and stream jobs
natively
• batch-only jobs pass via different optimization
code path
▪ Flink is building libraries and DSLs on top
of both batch and streaming
• e.g., see recent Table API
23
Data streaming with Flink
▪ Low-latency stream processor
▪ Expressive APIs in Scala/Java
▪ Stateful operators and flexible windowing
▪ Efficient fault tolerance for exactly-once
guarantees
24
Summary
▪ Flink is a general-purpose data analytics
system
▪ Unifies batch and stream processing
▪ Expressive high-level APIs
▪ Robust and fast execution engine
25
flink.apache.org
@ApacheFlink

Apache Flink: Past, Present and Future

  • 1.
    Apache Flink Past, presentand future Gyula Fóra gyfora@apache.org
  • 2.
    What is ApacheFlink 2 Distributed Data Flow Processing System ▪ Focused on large-scale data analytics ▪ Unified real-time stream and batch processing ▪ Easy and powerful APIs in Java / Scala (+ Python) ▪ Robust and fast execution backend Reduce Join Filter Reduce Map Iterate Source Sink Source
  • 3.
    What is Flinkgood at 3 It‘s a general-purpose data analytics system ▪ Real-time stream processing with flexible windowing ▪ Complex and heavy ETL jobs ▪ Analyzing huge graphs ▪ Machine learning on large data sets and streams ▪ …
  • 4.
    The Flink Stack 4 Python Gelly Table ML SAMOA BatchOptimizer DataSet (Java/Scala) DataStream (Java/Scala) Streaming optimizer Hadoop M/R Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow
  • 5.
    Word count inFlink 5 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(1,MINUTES)).every(Time.of(30,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 6.
    Table API 6 val orders= env.readCsvFile(…) .as('oId, 'oDate, 'shipPrio) .filter('shipPrio === 5) val items = orders .join(lineitems).where('oId === 'id) .select('oId, 'oDate, 'shipPrio, 'extdPrice * (Literal(1.0f) - 'discnt) as 'revenue) val result = items .groupBy('oId, 'oDate, 'shipPrio) .select('oId, 'revenue.sum, 'oDate, 'shipPrio) ▪ Execute SQL-like expressions on table data • Tight integration with Java and Scala APIs • Available for batch and streaming programs
  • 7.
    A trip downmemory lane 7
  • 8.
  • 9.
    9 Stratosphere Optimizer DataSet API(Java) Stratosphere Runtime DataSet API (Scala) Stratosphere 0.5 Local Remote Yarn Key new features • New Java API • Distributed cache • Collection data sources and sinks • JDBC data sources and sinks • Hadoop I/O format • Avro support
  • 10.
    10 Flink Optimizer DataSet (Java/Scala) FlinkRuntime Flink 0.7 DataStream (Java) Stream Builder Hadoop M/R Local Remote Yarn Embedded Key new features • Unification of Java and Scala APIs • Logical keys/POJO support • MR compatibility • Collections backend • Extended filesystem support
  • 11.
    11 Flink Runtime Flink 0.8 FlinkOptimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Builder Hadoop M/R Local Remote Yarn Embedded Key new features • Improved filesystem support • DataStream Scala • Streaming windows • Lots of performance and stability • Kryo default serializer
  • 12.
    12 Python Gelly Table ML SAMOA Current master (0.9-Snapshot) BatchOptimizer DataSet (Java/Scala) DataStream (Java/Scala) Stream Optimizer Hadoop M/R New Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow Key new features • New runtime • Tez mode • Python API • Gelly • Flinq • FlinkML • Streaming FT
  • 13.
    Flink community 13 #unique contributorsby git commits (without manual de-dup)
  • 14.
    Summary ▪ The projecthas a lot of momentum with major improvements every release ▪ Healthy community ▪ Project diversification • Real-time data streaming • Several frontends (targeting different user profiles and use cases) • Several backends (targeting different production settings) ▪ Integration with open source ecosystem 14
  • 15.
  • 16.
    What are webuilding? 16 A "use-case complete" framework to unify batch & stream processing Flink Data Streams • Kafka • RabbitMQ • ... “Historic” data • HDFS • JDBC • ... Analytical Workloads • ETL • Relational processing • Graph analysis • Machine learning • Streaming data analysis
  • 17.
    Flink Historic data Kafka, RabbitMQ,... HDFS, JDBC, ... ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ... Event logs An engine that puts equal emphasis to stream and batch processing Real-time data streams What are we building? (master)
  • 18.
  • 19.
    Why? ▪ Applications needto combine streaming and static data sources ▪ Making the switch from batch to streaming easy will be key to boost adoption ▪ Companies are making the transition from batch to streaming now 19
  • 20.
    What is streamprocessing? 20 ▪ Data stream: Infinite sequence of data arriving in a continuous fashion ▪Stream processing: Analyzing and acting on real-time streaming data, using continuous queries
  • 21.
    Lambda architecture ▪ "Speedlayer" can be a stream processing system ▪ "Picks up" after the batch layer 21
  • 22.
    Kappa architecture ▪ Needfor batch & speed layer not fundamental, practical with current tech ▪ Idea: use a stream processing system for all data processing ▪ They are all dataflows anyway 22http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
  • 23.
    Data streaming withFlink ▪ Flink is building a proper stream processing system • that can execute both batch and stream jobs natively • batch-only jobs pass via different optimization code path ▪ Flink is building libraries and DSLs on top of both batch and streaming • e.g., see recent Table API 23
  • 24.
    Data streaming withFlink ▪ Low-latency stream processor ▪ Expressive APIs in Scala/Java ▪ Stateful operators and flexible windowing ▪ Efficient fault tolerance for exactly-once guarantees 24
  • 25.
    Summary ▪ Flink isa general-purpose data analytics system ▪ Unifies batch and stream processing ▪ Expressive high-level APIs ▪ Robust and fast execution engine 25
  • 27.