Flink Streaming

Real-time data processing
with Flink Streaming
Gyula Fora
gyfora@apache.org
@GyulaFora
12/01/15 1

What is Flink Streaming
Part of Apache
Flink
Real-time data
processing
High
performance
Expressive
functional APIs
Programmable
in Java or
Scala
12/01/15 2

This Talk
• General introduction
• Flink Streaming APIs
• Running Flink programs
• Overview of Flink internals
• Development roadmap
• Summary
• Questions
12/01/15 3

Overview of stream processing
trends
Apache Storm
• True streaming, low latency - lower throughput
• Low level API (Bolts, Spouts) + Trident
Spark Streaming
• Stream processing on top of batch system, high throughput
- higher latency
• Functional API (DStreams), restricted by batch runtime
Flink Streaming
• True streaming with adjustable latency-throughput trade-off
• Rich functional API exploiting streaming runtime; e.g. rich
windowing semantics
12/01/15 4

Programming model
Data Stream
A
A (1)
A (2)
B (1)
B (2)
C (1)
C (2)
X
X
Y
Y
Program
Parallel Execution
X Y
Operator X Operator Y
Data abstraction: Data Stream
Data Stream
B
Data Stream
C
12/01/15 5

Flink Streaming APIs
12/01/15 6

Word count – Java
DataStream<String> text = env.socketTextStream(host,
port);
DataStream<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2<>(token, 1));
})
.groupBy(0)
.sum(1);
12/01/15 7
Socket
stream
Map Reduce
Output
stream

Word count - Scala
case class Word(word: String, count: Long)
val input = env.socketTextStream(host, port);
val words = input flatMap {
line => line.split("W+").map(Word(_,1)) }
val counts = words groupBy "word" sum "count"
12/01/15 8
Socket
stream
Map Reduce
Output
stream

Overview of the API
• Data stream sources
– File system
– Message queue connectors
– Arbitrary source functionality
• Stream transformations
– Basic transformations: Map, Reduce, Filter,
Aggregations…
– Windowing semantics: Policy based flexible
windowing (Time, Count, Delta…)
– Binary stream transformations: CoMap, CoReduce…
– Temporal binary stream operators: Joins, Crosses…
– Iterative stream transformations
• Data stream outputs
12/01/15 9

Data stream sources
• Process data from anywhere
• File-system sources
• Socket stream
• Message queues
– Kafka
– RabbitMQ
– Flume
• Scala/Java collections, streams, sequence generator
for development & testing
• Arbitrary source functionality using the SourceFunction
interface
– Only have to implement an invoke(out: Collector) method
12/01/15 10

Basic transformations
• Rich set of functional transformations:
– Map, FlatMap, Reduce, GroupReduce, Filter,
Project…
• Aggregations by field name or position
– Sum, Min, Max, MinBy, MaxBy, Count…
12/01/15 11
Reduce
Merge
FlatMap
Sum
Map
Source
Sink
Source

Windowing
• Flexible policy based windowing
• Trigger and Eviction policies
• Built-in policies:
– Time: Time.of(length, TimeUnit/Custom timestamp)
– Count: Count.of(windowSize)
– Delta: Delta.of(treshold, Delta function, Start value)
• Window transformations:
– Reduce
– ReduceGroup
– Grouped Reduce/ReduceGroup
• Custom trigger and eviction policies can also be
implemented easily
12/01/15 12

Windowing example
//Build new model every minute on the last 5 minutes
//worth of data
val model = trainingData
.window(Time.of(5, MINUTES))
.every(Time.of(1, MINUTES))
.reduceGroup(buildModel)
//Predict new data using the most up-to-date model
val prediction = newData
.connect(model)
.map(predict); M
P
Training Data
New Data Prediction
12/01/15 13

Temporal operators
• Binary stream operators that work on time
windows
• Database style operators:
– Join: s1.join(s2).onWindow(…).every(…)
.where(key1).equalTo(key2)
– Cross: s1.cross(s2).onWindow(…).every(…)
• UDFs can also be used for custom
operator logic on the elements in the
windows
12/01/15 14

Window Join example
case class Name(id: Long, name: String)
case class Age(id: Long, age: Int)
case class Person(name: String, age: Int)
val names = ...
val ages = ...
names.join(ages)
.onWindow(5, SECONDS)
.where("id")
.equalTo("id") {(n, a) => Person(n.name, a.age)}
12/01/15 15

Iterative stream processing
T R
Step function
Feedback stream
Output stream
def iterate[R](
stepFunction: DataStream[T] => (DataStream[T], DataStream[R]),
maxWaitTimeMillis: Long = 0 ): DataStream[R]
12/01/15 16

Iterative processing example
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.generateSequence(1, 10).iterate(incrementToTen, 1000)
.print
env.execute("Iterative example")
def incrementToTen(input: DataStream[Long]) = {
val incremented = input.map {_ + 1}
val split = incremented.split
{x => if (x >= 10) "out" else "feedback"}
(split.select("feedback"), split.select("out"))
}
12/01/15 17
Numbe
r
stream
Map Reduce
Output
stream
“out”
“feedback”

Running Flink Programs
12/01/15 18

Flink programs run everywhere
Cluster (Batch)
Local
Debugging
Fink Runtime or Apache Tez
As Java Collection
Programs
Embedded
(e.g., Web Container)
12/01/15 19

Little tuning or configuration
needed
• Requires no memory thresholds to configure
– Flink manages its own memory
• Requires no complicated network configs
– Pipelining engine requires much less memory for data exchange
• Requires no serializers to be configured
– Flink handles its own type extraction and data representation
• Programs can be adjusted to data
automatically
– Flink’s optimizer can choose execution strategies automatically
12/01/15 20

Distributed runtime
• Master (Job Manager)
handles job submission,
scheduling, and
metadata
• Workers (Task
Managers) execute
operations
• Data can be streamed
between nodes
• Data output is buffered
for higher-throughput
(tunable)12/01/15 22

Hybrid batch/streaming
• True data streaming on the runtime layer
• Data flow based runtime
• No unnecessary synchronization steps
• Batch and stream processing seamlessly
integrated
12/01/15 23

Development roadmap
12/01/15 24

Roadmap
• Fault tolerance – 2015 Q1-2
• Lambda architecture – 2015 Q2
• Integration with other frameworks
– SAMOA – 2015 Q1
– Zeppelin – 2015 ?
• Streaming machine learning library – 2015
Q3
• Streaming graph processing library – 2015
Q3
12/01/15 25

Fault tolerance
• At-least-once semantics
– Currently an alpha version
– Source level in-memory replication
– Record acknowledgments
• Exactly once semantics
– Final goal, current research
– Upstream backup with state checkpointing
12/01/15 26

Lambda architecture
In other systems
Source: https://www.mapr.com/developercentral/lambda-architecture
12/01/15 27

Lambda architecture
- One System
- One API
- One cluster
12/01/15 28
In Apache Flink

Flink vs Spark
0
20000
40000
60000
80000
100000
120000
140000
160000
2 4 6 8 10 12
Time(ms)
Memonry in GB
Processing me (300 mil pkt)
spark
fli
n
k
hpc
12/01/15 31

Summary
• Flink combines true streaming runtime with
expressive high-level APIs for a next-gen
stream processing solution
• Tunable throughput-latency trade-off with
competitive performance at both ends
• Iterative processing support opens new
horizons in online machine learning
• We are just getting started!
– Lambda architecture
– Integrations
12/01/15 32

Where to find us
flink.apache.org
github.com/apache/flink
@ApacheFlink
gyfora@apache.org
12/01/15 33

Flink Streaming

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Flink Streaming

Similar to Flink Streaming (20)

More from Gyula Fóra

More from Gyula Fóra (6)

Recently uploaded

Recently uploaded (20)

Flink Streaming