Gyula Fóra
gyfora@apache.org
Flink committer
Swedish ICT
Real-time data processing
with Apache Flink
What is Apache Flink?
2
What is Apache Flink
3
Distributed Data Flow Processing System
▪ Focused on large-scale data analytics
▪ Unified real-time stream and batch processing
▪ Easy and powerful APIs in Java / Scala (+ Python)
▪ Robust and fast execution backend
Reduce
Join
Filter
Reduce
Map
Iterate
Source
Sink
Source
Flink Stack (0.9.0)
4
Gelly
Table
ML
SAMOA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
HadoopM/R
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
MRQL
Table
Cascading(WiP)Streaming dataflow runtime
Stream processing
5
6
▪ Data stream: Infinite sequence of data arriving in a continuous
fashion.
▪ Stream processing: Analyzing and acting on real-time streaming
data, using continuous queries
Stream processing
3 Parts of a Streaming Infrastructure
7
Gathering Broker Analysis
Sensors
Transaction
logs …
Server Logs
8
Apache Storm
• True streaming over distributed dataflow
• Low level API (Bolts, Spouts) + Trident
Spark Streaming
• Stream processing emulated on top of batch system (non-native)
• Functional API (DStreams), restricted by batch runtime
Apache Samza
• True streaming built on top of Apache Kafka, state is first class citizen
• Slightly different stream notion, low level API
Apache Flink
• True streaming over stateful distributed dataflow
• Rich functional API exploiting streaming runtime; e.g. rich windowing
semantics
Streaming landscape
Flink Streaming
9
What is Flink Streaming
10
 Native, low-latency stream processor
 Expressive functional API
 Flexible operator state, stream windows
 Exactly-once processing semantics
Native vs non-native streaming
11
Stream
discretizer
Job Job Job Jobwhile (true) {
// get next few records
// issue batch computation
}
while (true) {
// process next record
}
Long-standing
operators
Non-native streaming
Native streaming
Computation vs Operator state
12
 Computation state
• Result of a continuous computation (stream)
• Typically produced by combining independent
results (update function)
• Transformations cannot interact with it!
 Operator state
• Lives inside long standing operators
• Transformations can interact with the state (Map
function dependent on internal state)
 Examples for stateful operators
• Pattern, anomaly detection
• Multipass machine learning algorithms
DataStream API
13
Overview of the API
 Data stream sources
• File system
• Message queue connectors
• Arbitrary source functionality
 Stream transformations
• Basic transformations: Map, Reduce, Filter, Aggregations…
• Binary stream transformations: CoMap, CoReduce…
• Windowing semantics: Policy based flexible windowing (Time, Count,
Delta…)
• Temporal binary stream operators: Joins, Crosses…
• Native support for iterations
 Data stream outputs
 For the details please refer to the programming guide:
• http://flink.apache.org/docs/latest/streaming_guide.html 14
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src
Word count in Batch and Streaming
15
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Flexible windows
16
More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
17
▪ Performance optimizations
• Effective serialization due to strongly typed topologies
• Operator chaining (thread sharing/no serialization)
• Different automatic query optimizations
▪ Competitive performance
• ~ 1.5m events / sec / core
• As a comparison Storm promises ~ 1m tuples / sec /
node
Performance
Fault tolerance
18
Overview
19
▪ Fault tolerance in other systems
• Message tracking/acks (Storm)
• RDD re-computation (Spark)
 Fault tolerance in Apache Flink
• Based on consistent global snapshots
• Algorithm inspired by Chandy-Lamport
• Low runtime overhead, stateful exactly-once
semantics
Checkpointing / Recovery
20
Asynchronous Barrier Snapshotting for globally consistent checkpoints
Pushes checkpoint barriers
through the data flow
Operator checkpoint
starting
Checkpoint done
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
Checkpoint done
checkpoint in progress
(backup till next snapshot)
State management
21
 State declared in the operators is managed and
checkpointed by Flink
 Pluggable backends for storing persistent
snapshots
• Currently: JobManager, FileSystem (HDFS, Tachyon)
 State partitioning and flexible scaling in the
future
Closing
22
Streaming roadmap for 2015
 Improved state management
• New backends for state snapshotting
• Support for state partitioning and incremental
snapshots
• Master Failover
 Improved job monitoring
 Integration with other Apache projects
• SAMOA (PR ready), Zeppelin (PR ready), Ignite
 Streaming machine learning and other new
libraries 23
flink.apache.org
@ApacheFlink

Flink Streaming @BudapestData

  • 1.
    Gyula Fóra gyfora@apache.org Flink committer SwedishICT Real-time data processing with Apache Flink
  • 2.
  • 3.
    What is ApacheFlink 3 Distributed Data Flow Processing System ▪ Focused on large-scale data analytics ▪ Unified real-time stream and batch processing ▪ Easy and powerful APIs in Java / Scala (+ Python) ▪ Robust and fast execution backend Reduce Join Filter Reduce Map Iterate Source Sink Source
  • 4.
    Flink Stack (0.9.0) 4 Gelly Table ML SAMOA DataSet(Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow MRQL Table Cascading(WiP)Streaming dataflow runtime
  • 5.
  • 6.
    6 ▪ Data stream:Infinite sequence of data arriving in a continuous fashion. ▪ Stream processing: Analyzing and acting on real-time streaming data, using continuous queries Stream processing
  • 7.
    3 Parts ofa Streaming Infrastructure 7 Gathering Broker Analysis Sensors Transaction logs … Server Logs
  • 8.
    8 Apache Storm • Truestreaming over distributed dataflow • Low level API (Bolts, Spouts) + Trident Spark Streaming • Stream processing emulated on top of batch system (non-native) • Functional API (DStreams), restricted by batch runtime Apache Samza • True streaming built on top of Apache Kafka, state is first class citizen • Slightly different stream notion, low level API Apache Flink • True streaming over stateful distributed dataflow • Rich functional API exploiting streaming runtime; e.g. rich windowing semantics Streaming landscape
  • 9.
  • 10.
    What is FlinkStreaming 10  Native, low-latency stream processor  Expressive functional API  Flexible operator state, stream windows  Exactly-once processing semantics
  • 11.
    Native vs non-nativestreaming 11 Stream discretizer Job Job Job Jobwhile (true) { // get next few records // issue batch computation } while (true) { // process next record } Long-standing operators Non-native streaming Native streaming
  • 12.
    Computation vs Operatorstate 12  Computation state • Result of a continuous computation (stream) • Typically produced by combining independent results (update function) • Transformations cannot interact with it!  Operator state • Lives inside long standing operators • Transformations can interact with the state (Map function dependent on internal state)  Examples for stateful operators • Pattern, anomaly detection • Multipass machine learning algorithms
  • 13.
  • 14.
    Overview of theAPI  Data stream sources • File system • Message queue connectors • Arbitrary source functionality  Stream transformations • Basic transformations: Map, Reduce, Filter, Aggregations… • Binary stream transformations: CoMap, CoReduce… • Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) • Temporal binary stream operators: Joins, Crosses… • Native support for iterations  Data stream outputs  For the details please refer to the programming guide: • http://flink.apache.org/docs/latest/streaming_guide.html 14 Reduce Merge Filter Sum Map Src Sink Src
  • 15.
    Word count inBatch and Streaming 15 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 16.
    Flexible windows 16 More at:http://flink.apache.org/news/2015/02/09/streaming-example.html
  • 17.
    17 ▪ Performance optimizations •Effective serialization due to strongly typed topologies • Operator chaining (thread sharing/no serialization) • Different automatic query optimizations ▪ Competitive performance • ~ 1.5m events / sec / core • As a comparison Storm promises ~ 1m tuples / sec / node Performance
  • 18.
  • 19.
    Overview 19 ▪ Fault tolerancein other systems • Message tracking/acks (Storm) • RDD re-computation (Spark)  Fault tolerance in Apache Flink • Based on consistent global snapshots • Algorithm inspired by Chandy-Lamport • Low runtime overhead, stateful exactly-once semantics
  • 20.
    Checkpointing / Recovery 20 AsynchronousBarrier Snapshotting for globally consistent checkpoints Pushes checkpoint barriers through the data flow Operator checkpoint starting Checkpoint done Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot Checkpoint done checkpoint in progress (backup till next snapshot)
  • 21.
    State management 21  Statedeclared in the operators is managed and checkpointed by Flink  Pluggable backends for storing persistent snapshots • Currently: JobManager, FileSystem (HDFS, Tachyon)  State partitioning and flexible scaling in the future
  • 22.
  • 23.
    Streaming roadmap for2015  Improved state management • New backends for state snapshotting • Support for state partitioning and incremental snapshots • Master Failover  Improved job monitoring  Integration with other Apache projects • SAMOA (PR ready), Zeppelin (PR ready), Ignite  Streaming machine learning and other new libraries 23
  • 24.