Aljoscha Krettek
aljoscha@apache.org
@aljoscha
Apache FlinkTM
A Next-Generation Stream Processor
2
1. What is streaming?
2. What’s the technological
landscape?
3. Why is Apache Flink
special?
3
This is a stream
4
This is not a stream
5
Infinite data
vs.
finite data
infinite data → streaming
finite data → batch
6
Sometimes it
can feel like
this…
7
Stream Processing in a
Nutshell
 Infinite stream of incoming data
 We want up-to-date results
 Don’t wait for the nightly batch job
8
Some examples…
9
Tracking
user
satisfaction
in web
shops
10
Financial
transactions
Fraud
detection
11
Mobile
carriers
Online
tracking of
gaming
stats
13
14
What do they all have in
common?
Counting things over
certain periods of time.
A (Parallel) Streaming
Architecture
15
16
Why would I need a
parallel stream processor?
(tweet,
#hello
moe)
(?)
17
Remember this guy?
18
Parallel Stream Processing
(tweet,
#hello
sue)
(?)
(tweet,
#hello
poe)
(?)
(tweet,
#hello
moe)
(?)
19
LOG
Stream
Processor
20
Stream
Processor
Kafka
21
Parallel Stream Processors
22
Stream
Processor
Kafka
23
What does Flink provide?
24
// create stream from Kafka source
DataStream<LogEvent> stream =
env.addSource(new FlinkKafkaConsumer(...));
// group by country
DataStream<LogEvent> keyedStream = stream.keyBy("country");
keyedStream
.timeWindow(Time.minutes(60)) // window of size 1 hour
.apply(new CountPerWindowFunction()); // do operations per window
Counting with the
Flink API
25
From API to Topology
Kafka Source Kafka Sink
“count”
Operator
Job Graph
26
Master
Worker Worker Worker
A Flink Cluster
27
All Together, Parallel
MasterWorkers
28
What Makes Flink Special?
A Next-Generation Stream Processor
29
30
Disclaimer
 Some of this stuff is in Flink right now
 Some will probably make it into the
next release
 We (dataArtisans) don’t control the
Flink Roadmap, the community does
32
A Streaming Pipeline
Kafka Kafka
“count”
operator
count user interactions per 1 hour window
33
Interlude: Stateful vs.
Stateless
(tweet,
#hello
moe)
(?)operator
34
Stateless
(tweet,
#hello
moe)
“ciao”
operator
(tweet,
#ciao
moe)
The operation can only look at one element at a time.
35
Stateful
(tweet,
#hello
moe)
“count”
operator
(moe
mentioned
5 times)
The operation can keep information about past elements.
36
Stateful Stateless
 Aggregation
 Complex Event
Processing (CEP)
 Machine learning
models
 Ingestion
 Data cleansing
 Stateless
transformations
37
Back to the main story…
38
A Streaming Pipeline
(again)
Kafka Kafka
“count”
operator
count user interactions per 1 hour window
39
Problem?
Results only arrive by the
hour.
40
Solution:
Queryable State
41
Internal State
moe
5
sue
12
poe
2
State
timer service
42
Queryable State
Kafka Kafka
count for
“moe” ?
moe
5
Queryable State at Twitter: http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
“count”
operator
43
Back to our Streaming
Pipeline (…again, really?)
Kafka Kafka
“count”
operator
count user interactions per 1 hour window
44
What happens if you
need/want to…
 change the number of workers
 migrate to a different cluster
 fix a bug in your code
 fix a bug in our code (Flink)
 test different versions of an
algorithm
45
Stateless job → easy
Just stop and restart the
job
46
Stateful job → tricky
State needs to be
re-loaded/re-distributed
47
Savepoints
“create savepoint”
“change program”
“restart from savepoint”
48
Sessionization*
* also called “session windows”
 Based on the timestamp of
events
 Turns out this is tricky to do
right
 Flink supports this out-of-the-
box with version 1.1
 ask me about this afterwards☺
Closing
49
50
There is more cool stuff
 Dynamic rescaling of streaming jobs
 SQL on streams
 Windowing API improvements
 Running on Mesos
51
tl;dl*
 Stream processing is the cool new
thing
 Flink is already very good at it
 There is plenty of interesting stuff
coming up
* too long, didn’t listen
52
 Follow @ApacheFlink, @dataArtisans
 Read flink.apache.org/blog, data-artisans.com/blog
 Subscribe (news | user | dev) @ flink.apache.org
Join the Community!
We are hiring!
data-artisans.com/careers
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org
Appendix
Yahoo! Streaming Benchmark
56
57
Performance
• Performance always depends on your own use
cases, so test it yourself!
• We based our experiments on a recent
benchmark published by Yahoo!
• They benchmarked Storm, Spark Streaming
and Flink with a production use-case (counting
ad impressions)
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
58
Yahoo! Benchmark
• Count ad impressions grouped by campaign
• Compute aggregates over a 10 second window
• Emit current value of window aggregates to
Redis every second for query
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
59
Flink and Storm usually
at sub-second latencies
Spark latency increases
with throughout, at 8 sec
Results (lower is better)
60
• Benchmark stops at Storm’s throughput limits.
Where is Flink’s limit?
• How will Flink’s own window implementation
perform compared to Yahoo’s “state in redis
windowing” approach?
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
Extending the benchmark
61
KafkaConsumer
map()
filter()
group
windowing &
caching code
realtime queries
Windowing with State in
Redis
62
KafkaConsumer
map()
filter()
group
Flink event
time
windows
realtime queries
Rewrite to use Flink’s
own Windowing
63
0 750,000 1,500,000 2,250,000 3,000,000 3,750,000
Storm
Flink
Throughput: msgs/sec
400k msgs/sec
Results after Rewrite
64
KafkaConsumer
map()
filter()
group
Flink event
time
windows
Network link to
Kafka cluster is
bottleneck!
(1GigE)
Data Generator
map()
filter()
group
Flink event
time
windows
Solution: Move
data generator
into job (10 GigE)
Can we go further?
65
0 4,000,000 8,000,000 12,000,000 16,000,000
Storm
Flink
Flink (10 GigE)
Throughput: msgs/sec
10 GigE end-to-end
15m msgs/sec
400k msgs/sec
3m msgs/sec
Results without Network
Bottleneck
66
• Flink achieves throughput of 15 million
messages/second on 10 machines
• 35x higher throughput compared to Storm
(80x compared to Yahoo’s runs)
• Flink ran with exactly once guarantees, Storm
with at least once.
• Read the full report: http://data-
artisans.com/extending-the-yahoo-streaming-
benchmark/
Benchmark Summary
Appendix 2
67
Roadmap 2016
68
• SQL / StreamSQL
• CEP Library
• Dynamic Scaling
• Miscellaneous
Miscellaneous
• Support for Apache Mesos
• Security
– Over-the-wire encryption of RPC (akka) and data
transfers (netty)
• More connectors
– Apache Cassandra
– Amazon Kinesis
• Enhance metrics
– Throughput / Latencies
– Backpressure monitoring
– Spilling / Out of Core
69
Fault Tolerance and correctness
70
4
3
4 2
• How can we ensure the state is always in sync
with the events?
event counter
final operator
Naïve state checkpointing approach
71
• Process some records:
• Stop everything,
store state:
• Continue processing …
0
0
0 0
1
1
2 2
Operator State
a 1
b 1
c 2
d 2
a
b
c d
Distributed Snapshots
72
0
0
0 0
1
1
0 0
Initial state
Start processing
1
1
0 0
Trigger checkpoint
Operator State
a 1
b 1
Distributed Snapshots
73
2
1
2 0
Operator State
a 1
b 1
c 2
Barrier flows with events
2
1
2 2
Checkpoint completed Operator State
a 1
b 1
c 2
d 2
• Valid snapshot without stopping the topology
• Multiple checkpoints can be in-flight
Complete,
consistent
state snapshot
Analysis of naïve approach
 Introduces latency
 Reduces throughput
• Can we create a correct snapshot while
keeping the job running?
• Yes! By creating a distributed snapshot
74
Handling Backpressure
75
Slow down
upstream
operators
Backpressure might occur when:
• Operators create checkpoints
• Windows are evaluated
• Operators depend on external
resources
• JVMs do Garbage Collection
Operator not able
to process
incoming data
immediately
Handling Backpressure
76
Sender
Sender
Receiver
Receiver
Sender does not have any
empty buffers available:
Slowdown
Network transfer (Netty) or
local buffer exchange
(when S and R are on the
same machine)
• Data sources slow down pulling data from their underlying
system (Kafka or similar queues)
Full buffer
Empty buffer
How do latency and throughput affect each
other?
flink.apache.org 7730 Machines, one repartition step
Sender
Sender
Receiver
Receiver
Send buffer when
full or timeout
• High throughput by batching events in network
buffers
• Filling the buffers introduces latency
• Configurable buffer timeout
Aggregate throughput for stream record
grouping
78
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
90,000,000
100,000,000
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, no
fault
tolerance
Storm, at
least once
aggregate throughput
of 83 million elements
per second
8,6 million elements/s
309k elements/s  Flink achieves 260x
higher throughput with
fault tolerance
30 machines,
120 cores,
Google Compute
Performance: Summary
79
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
With configurable throughput/latency tradeoff
The building blocks: Summary
80
Low latency
High throughput
State handling
Windowing / Out
of order events
Fault tolerance
and correctness
• Tumbling / sliding windows
• Event time / processing time
• Low watermarks for out of order
events
• Managed operator state for
backup/recovery
• Large state with RocksDB
• Savepoints for operations
• Exactly-once semantics for
managed operator state
• Lightweight, asynchronous
distributed snapshotting algorithm
• Efficient, pipelined runtime
• no per-record operations
• tunable latency / throughput
tradeoff
• Async checkpoints
Low Watermarks
• We periodically send low-watermarks through
the system to indicate the progression of
event time.
81
For more details: “MillWheel: Fault-Tolerant Stream Processing at Internet
Scale” by T. Akidau et. al.
33 11 28 21 15 958
Guarantee that no event with time
<= 5 will arrive afterwards
Window
between
0 and 15
Window is evaluated when
watermarks arrive
Low Watermarks
82
For more details: “MillWheel: Fault-Tolerant Stream Processing at Internet Scale”
by T. Akidau et. al.
Operator 35
Operators with multiple inputs
always forward the lowest
watermark
Bouygues Telecom
83
Bouygues Telecom
84
Bouygues Telecom
85
Capital One
86
Fault Tolerance in streaming
• Failure with “at least once”: replay
87
4
3
4 2
Restore from: Final result:
7
5
9 7
Fault Tolerance in streaming
• Failure with “exactly once”: state restore
88
1
1
2 2
Restore from: Final result:
4
3
7 7
Latency in stream record grouping
89
Data
Generator
Receiver:
Throughput /
Latency measure
• Measure time for a record to
travel from source to sink
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Flink, no
fault
tolerance
Flink, exactly
once
Storm, at
least once
Median latency
25 ms
1 ms
0.00
10.00
20.00
30.00
40.00
50.00
60.00
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, at
least
once
99th percentile
latency
50 ms
Savepoints: Simplifying Operations
• Streaming jobs usually run 24x7 (unlike batch).
• Application bug fixes: Replay your job from a
certain point in time (savepoint)
• Flink bug fixes
• Maintenance and system migration
• What-If simulations: Run different
implementations of your code against a
savepoint
90
Pipelining
91
Basic building block to “keep the data moving”
• Low latency
• Operators push
data forward
• Data shipping as
buffers, not tuple-
wise
• Natural handling
of back-pressure

Apache Flink(tm) - A Next-Generation Stream Processor

Editor's Notes

  • #79 Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core
  • #80 People previously made the case that high throughput and low latency are mutually exclusive
  • #85 SLA
  • #90 Flink 720,000 events per second per core 690,000 with checkpointing activated Storm With at-least-once: 2,600 events per second per core