Stream Processing with
Apache Flink
Robert Metzger
@rmetzger_
rmetzger@apache.org
QCon London,
March 7, 2016
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/stream-processing-apache-flink
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
Talk overview
 My take on the stream processing space,
and how it changes the way we think
about data
 Discussion of unique building blocks of
Flink
 Benchmarking Flink, by extending a
benchmark from Yahoo!
2
Apache Flink
 Apache Flink is an open source stream
processing framework
• Low latency
• High throughput
• Stateful
• Distributed
 Developed at the Apache Software
Foundation, 1.0.0 release available soon,
used in production
3
Entering the streaming era
4
5
Streaming is the biggest change in
data infrastructure since Hadoop
6
1. Radically simplified infrastructure
2. Do more with your data, faster
3. Can completely subsume batch
Traditional data processing
7
Web server
Logs
Web server
Logs
Web server
Logs
HDFS / S3
Periodic (custom) or
continuous ingestion
(Flume) into HDFS
Batch job(s) for
log analysis
Periodic log analysis
job
Serving
layer
 Log analysis example using a batch
processor
Job scheduler
(Oozie)
Traditional data processing
8
Web server
Logs
Web server
Logs
Web server
Logs
HDFS / S3
Periodic (custom) or
continuous ingestion
(Flume) into HDFS
Batch job(s) for
log analysis
Periodic log analysis
job
Serving
layer
 Latency from log event to serving layer
usually in the range of hours
every 2 hrs
Job scheduler
(Oozie)
Data processing without stream
processor
9
Web server
Logs
Web server
Logs
HDFS / S3
Batch job(s) for
log analysis
 This architecture is a hand-crafted micro-
batch model
Batch interval: ~2 hours
hours minutes milliseconds
Manually triggered
periodic batch job
Batch processor
with micro-batches
Latency
Approach
seconds
Stream processor
Downsides of stream processing with a
batch engine
 Very high latency (hours)
 Complex architecture required:
• Periodic job scheduler (e.g. Oozie)
• Data loading into HDFS (e.g. Flume)
• Batch processor
• (When using the “lambda architecture”: a stream
processor)
All these components need to be
implemented and maintained
 Backpressure: How does the pipeline handle
load spikes?
10
Log event analysis using a
stream processor
11
Web server
Web server
Web server
High throughput
publish/subscribe
bus
Serving
layer
 Stream processors allow to analyze
events with sub-second latency.
Options:
• Apache Kafka
• Amazon Kinesis
• MapR Streams
• Google Cloud Pub/Sub
Forward events
immediately to
pub/sub bus
Stream Processor
Options:
• Apache Flink
• Apache Beam
• Apache Samza
Process events in real
time & update
serving layer
12
Real-world data is produced in a
continuous fashion.
New systems like Flink and Kafka
embrace streaming nature of data.
Web server Kafka topic
Stream processor
What do we need for replacing
the “batch stack”?
13
Web server
Web server
Web server
High throughput
publish/subscribe
bus
Serving
layer
Options:
• Apache Kafka
• Amazon Kinesis
• MapR Streams
• Google Cloud Pub/Sub
Forward events
immediately to
pub/sub bus
Stream Processor
Options:
• Apache Flink
• Google Cloud
Dataflow
Process events in real
time & update
serving layer
Low latency
High throughput
State handling
Windowing / Out
of order events
Fault tolerance
and correctness
Apache Flink stack
15
Gelly
Table
ML
SAMOA
DataSet (Java/Scala)DataStream (Java / Scala)
HadoopM/R
LocalClusterYARN
ApacheBeam
ApacheBeam
Table
Cascading
Streaming dataflow runtime
StormAPI
Zeppelin
CEP
Needed for the use case
16
Gelly
Table
ML
SAMOA
DataSet (Java/Scala)DataStream (Java / Scala)
HadoopM/R
LocalClusterYARN
ApacheBeam
ApacheBeam
Table
Cascading
Streaming dataflow runtime
StormAPI
Zeppelin
CEP
Windowing / Out of order
events
17
Low latency
High throughput
State handling
Windowing / Out
of order events
Fault tolerance
and correctness
Building windows from a stream
18
 “Number of visitors in the last 5 minutes
per country”
Web server Kafka topic
Stream processor
// create stream from Kafka source
DataStream<LogEvent> stream = env.addSource(new KafkaConsumer());
// group by country
DataStream<LogEvent> keyedStream = stream.keyBy(“country“);
// window of size 5 minutes
keyedStream.timeWindow(Time.minutes(5))
// do operations per window
.apply(new CountPerWindowFunction());
Building windows: Execution
19
Kafka
Source
Window
Operator
S
S
S
W
W
W
group by
country
// window of size 5 minutes
keyedStream.timeWindow(Time.minutes(5));
Job plan Parallel execution on the cluster
Time
Window types in Flink
 Tumbling windows
 Sliding windows
 Custom windows with window assigners,
triggers and evictors
20Further reading: http://flink.apache.org/news/2015/12/04/Introducing-windows.html
Time-based windows
21
Stream
Time of event
Event data
{
“accessTime”: “1457002134”,
“userId”: “1337”,
“userLocation”: “UK”
}
 Windows are created based on the real
world time when the event occurred
A look at the reality of time
22
Kafka
Network delays
Out of sync clocks
33 11 21 15 9
 Events arrive out of order in the system
 Use-case specific low watermarks for time
tracking
Window between
0 and 15
Stream Processor
15
Guarantee that no event with time
<= 15 will arrive afterwards
Time characteristics in Apache Flink
 Event Time
• Users have to specify an event-time extractor +
watermark emitter
• Results are deterministic, but with latency
 Processing Time
• System time is used when evaluating windows
• low latency
 Ingestion Time
• Flink assigns current system time at the sources
 Pluggable, without window code changes
23
State handling
24
Low latency
High throughput
State handling
Windowing / Out
of order events
Fault tolerance
and correctness
State in streaming
 Where do we store the elements from our
windows?
 In stateless systems, an external state
store (e.g. Redis) is needed.
25
S
S
S
W
W
W
Time
Elements in windows
are state
Stream processor: Flink
Managed state in Flink
 Flink automatically backups and restores state
 State can be larger than the available memory
 State backends: (embedded) RocksDB, Heap
memory
26
Operator with windows
(large state)
State
backend
(local)
Distributed
File System
Periodic backup /
recovery
Web
server
Kafka
Managing the state
 How can we operate such a pipeline
24x7?
 Losing state (by stopping the system)
would require a replay of past events
 We need a way to store the state
somewhere!
27
Web server Kafka topic
Stream processor
Savepoints: Versioning state
 Savepoint: Create an addressable copy of a
job’s current state.
 Restart a job from any savepoint.
28
Further reading: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/
> flink savepoint <JobID>
HDFS
> hdfs:///flink-savepoints/2
> flink run –s hdfs:///flink-savepoints/2 <jar>
HDFS
Fault tolerance and
correctness
29
Low latency
High throughput
State handling
Windowing / Out
of order events
Fault tolerance
and correctness
Fault tolerance in streaming
 How do we ensure the results (number of
visitors) are always correct?
 Failures should not lead to data loss or
incorrect results
30
Web server Kafka topic
Stream processor
Fault tolerance in streaming
 at least once: ensure all operators see all
events
• Storm: Replay stream in failure case (acking
of individual records)
 Exactly once: ensure that operators do
not perform duplicate updates to their
state
• Flink: Distributed Snapshots
• Spark: Micro-batches on batch runtime
31
Flink’s Distributed Snapshots
 Lightweight approach of storing the state
of all operators without pausing the
execution
 Implemented using barriers flowing
through the topology
32
Data Stream
barrier
Before barrier =
part of the snapshot
After barrier =
Not in snapshot
Further reading: http://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for-
distributed-dataflows/
Wrap-up: Log processing example
 How to do something with the data?
Windowing
 How does the system handle large windows?
Managed state
 How do operate such a system 24x7?
Safepoints
 How to ensure correct results across failures?
Checkpoints, Master HA
33
Web server Kafka topic
Stream processor
Performance:
Low Latency & High Throughput
34
Low latency
High throughput
State handling
Windowing / Out
of order events
Fault tolerance
and correctness
Performance: Introduction
 Performance always depends on your own
use cases, so test it yourself!
 We based our experiments on a recent
benchmark published by Yahoo!
 They benchmarked Storm, Spark
Streaming and Flink with a production use-
case (counting ad impressions)
35
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
Yahoo! Benchmark
 Count ad impressions grouped by
campaign
 Compute aggregates over a 10 second
window
 Emit current value of window aggregates
to Redis every second for query
36
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
Yahoo’s Results
“Storm […] and Flink […] show sub-second
latencies at relatively high throughputs with
Storm having the lowest 99th percentile
latency. Spark streaming 1.5.1 supports high
throughputs, but at a relatively higher
latency.”
(Quote from the blog post’s executive summary)
37
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
Extending the benchmark
 Benchmark stops at Storm’s throughput
limits. Where is Flink’s limit?
 How will Flink’s own window
implementation perform compared to
Yahoo’s “state in redis windowing”
approach?
38
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
Windowing with state in Redis
39
KafkaConsumer
map()
filter()
group
windowing &
caching code
realtime queries
Rewrite to use Flink’s own window
40
KafkaConsumer
map()
filter()
group
Flink event
time
windows
realtime queries
Results after rewrite
41
0 750.000 1.500.000 2.250.000 3.000.000 3.750.000
Storm
Flink
Throughput: msgs/sec
400k msgs/sec
Can we even go further?
42
KafkaConsumer
map()
filter()
group
Flink event
time
windows
Network link to
Kafka cluster is
bottleneck!
(1GigE)
Data Generator
map()
filter()
group
Flink event
time
windows
Solution: Move
data generator
into job (10 GigE)
Results without network bottleneck
43
0 4.000.000 8.000.000 12.000.000 16.000.000
Storm
Flink
Flink (10 GigE)
Throughput: msgs/sec
10 GigE end-to-end
15m msgs/sec
400k msgs/sec
3m msgs/sec
Benchmark summary
 Flink achieves throughput of 15 million
messages/second on 10 machines
 35x higher throughput compared to
Storm (80x compared to Yahoo’s runs)
 Flink ran with exactly once guarantees,
Storm with at least once.
 Read the full report: http://data-
artisans.com/extending-the-yahoo-
streaming-benchmark/
44
Closing
45
Other notable features
 Expressive DataStream API (similar to high
level APIs known from the batch world)
 Flink is a full-fledged batch processor with
an optimizer, managed memory, memory-
aware algorithms, build-in iterations
 Many libraries: Complex Event Processing
(CEP), Graph Processing, Machine Learning
 Integration with YARN, HBase,
ElasticSearch, Kafka, MapReduce, …
46
Questions?
 Ask now!
 eMail: rmetzger@apache.org
 Twitter: @rmetzger_
 Follow: @ApacheFlink
 Read: flink.apache.org/blog, data-
artisans.com/blog/
 Mailinglists: (news | user | dev)@flink.apache.org
47
Apache Flink stack
48
Gelly
Table
ML
SAMOA
DataSet (Java/Scala)DataStream (Java / Scala)
HadoopM/R
LocalClusterYARN
ApacheBeam
ApacheBeam
Table
Cascading
Streaming dataflow runtime
StormAPI
Zeppelin
CEP
Appendix
49
Roadmap 2016
50
 SQL / StreamSQL
 CEP Library
 Managed Operator State
 Dynamic Scaling
 Miscellaneous
Miscellaneous
 Support for Apache Mesos
 Security
• Over-the-wire encryption of RPC (akka) and data
transfers (netty)
 More connectors
• Apache Cassandra
• Amazon Kinesis
 Enhance metrics
• Throughput / Latencies
• Backpressure monitoring
• Spilling / Out of Core
51
Fault Tolerance and correctness
52
4
3
4 2
 How can we ensure the state is always in
sync with the events?
event counter
final operator
Naïve state checkpointing approach
53
 Process some records:
 Stop everything,
store state:
 Continue processing …
0
0
0 0
1
1
2 2
Operator State
a 1
b 1
c 2
d 2
a
b
c d
Distributed Snapshots
54
0
0
0 0
1
1
0 0
Initial state
Start processing
1
1
0 0
Trigger checkpoint
Operator State
a 1
b 1
Distributed Snapshots
55
2
1
2 0
Operator State
a 1
b 1
c 2
Barrier flows with events
2
1
2 2
Checkpoint completed Operator State
a 1
b 1
c 2
d 2
 Valid snapshot without stopping the topology
 Multiple checkpoints can be in-flight
Complete,
consistent
state snapshot
Analysis of naïve approach
 Introduces latency
 Reduces throughput
 Can we create a correct snapshot while
keeping the job running?
 Yes! By creating a distributed snapshot
56
Handling Backpressure
57
Slow down
upstream
operators
Backpressure might occur when:
• Operators create checkpoints
• Windows are evaluated
• Operators depend on external
resources
• JVMs do Garbage Collection
Operator not able
to process
incoming data
immediately
Handling Backpressure
58
Sender
Sender
Receiver
Receiver
Sender does not have any
empty buffers available:
Slowdown
Network transfer (Netty) or
local buffer exchange
(when S and R are on the
same machine)
• Data sources slow down pulling data from their underlying
system (Kafka or similar queues)
Full buffer
Empty buffer
How do latency and throughput affect
each other?
flink.apache.org 5930 Machines, one repartition step
Sender
Sender
Receiver
Receiver
Send buffer when
full or timeout
• High throughput by batching events in network
buffers
• Filling the buffers introduces latency
• Configurable buffer timeout
Aggregate throughput for stream record
grouping
60
0
10.000.000
20.000.000
30.000.000
40.000.000
50.000.000
60.000.000
70.000.000
80.000.000
90.000.000
100.000.000
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, no
fault
tolerance
Storm, at
least once
aggregate throughput
of 83 million elements
per second
8,6 million elements/s
309k elements/s  Flink achieves 260x
higher throughput with
fault tolerance
30 machines,
120 cores,
Google Compute
Performance: Summary
61
Continuous
streaming
Latency-bound
buffering
Distributed
Snapshots
High Throughput &
Low Latency
With configurable throughput/latency tradeoff
The building blocks: Summary
62
Low latency
High throughput
State handling
Windowing / Out
of order events
Fault tolerance
and correctness
• Tumbling / sliding windows
• Event time / processing time
• Low watermarks for out of order
events
• Managed operator state for
backup/recovery
• Large state with RocksDB
• Savepoints for operations
• Exactly-once semantics for
managed operator state
• Lightweight, asynchronous
distributed snapshotting algorithm
• Efficient, pipelined runtime
• no per-record operations
• tunable latency / throughput
tradeoff
• Async checkpoints
Low Watermarks
 We periodically send low-watermarks
through the system to indicate the
progression of event time.
63
For more details: “MillWheel: Fault-Tolerant Stream Processing at Internet
Scale” by T. Akidau et. al.
33 11 28 21 15 958
Guarantee that no event with time
<= 5 will arrive afterwards
Window
between
0 and 15
Window is evaluated when
watermarks arrive
Low Watermarks
64
For more details: “MillWheel: Fault-Tolerant Stream Processing at Internet Scale”
by T. Akidau et. al.
Operator 35
Operators with multiple inputs
always forward the lowest
watermark
Bouygues Telecom
65
Bouygues Telecom
66
Bouygues Telecom
67
Capital One
68
Fault Tolerance in streaming
 Failure with “at least once”: replay
69
4
3
4 2
Restore from: Final result:
7
5
9 7
Fault Tolerance in streaming
 Failure with “exactly once”: state restore
70
1
1
2 2
Restore from: Final result:
4
3
7 7
Latency in stream record grouping
71
Data
Generator
Receiver:
Throughput /
Latency measure
• Measure time for a record to
travel from source to sink
0,00
5,00
10,00
15,00
20,00
25,00
30,00
Flink, no
fault
tolerance
Flink, exactly
once
Storm, at
least once
Median latency
25 ms
1 ms
0,00
10,00
20,00
30,00
40,00
50,00
60,00
Flink, no
fault
tolerance
Flink,
exactly
once
Storm, at
least
once
99th percentile
latency
50 ms
Savepoints: Simplifying Operations
 Streaming jobs usually run 24x7 (unlike
batch).
 Application bug fixes: Replay your job
from a certain point in time (savepoint)
 Flink bug fixes
 Maintenance and system migration
 What-If simulations: Run different
implementations of your code against a
savepoint
72
Pipelining
73
Basic building block to “keep the data moving”
• Low latency
• Operators push
data forward
• Data shipping as
buffers, not tuple-
wise
• Natural handling
of back-pressure
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/stream-
processing-apache-flink

Stream Processing with Apache Flink

  • 1.
    Stream Processing with ApacheFlink Robert Metzger @rmetzger_ rmetzger@apache.org QCon London, March 7, 2016
  • 2.
    InfoQ.com: News &Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /stream-processing-apache-flink
  • 3.
    Purpose of QCon -to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4.
    Talk overview  Mytake on the stream processing space, and how it changes the way we think about data  Discussion of unique building blocks of Flink  Benchmarking Flink, by extending a benchmark from Yahoo! 2
  • 5.
    Apache Flink  ApacheFlink is an open source stream processing framework • Low latency • High throughput • Stateful • Distributed  Developed at the Apache Software Foundation, 1.0.0 release available soon, used in production 3
  • 6.
  • 7.
    5 Streaming is thebiggest change in data infrastructure since Hadoop
  • 8.
    6 1. Radically simplifiedinfrastructure 2. Do more with your data, faster 3. Can completely subsume batch
  • 9.
    Traditional data processing 7 Webserver Logs Web server Logs Web server Logs HDFS / S3 Periodic (custom) or continuous ingestion (Flume) into HDFS Batch job(s) for log analysis Periodic log analysis job Serving layer  Log analysis example using a batch processor Job scheduler (Oozie)
  • 10.
    Traditional data processing 8 Webserver Logs Web server Logs Web server Logs HDFS / S3 Periodic (custom) or continuous ingestion (Flume) into HDFS Batch job(s) for log analysis Periodic log analysis job Serving layer  Latency from log event to serving layer usually in the range of hours every 2 hrs Job scheduler (Oozie)
  • 11.
    Data processing withoutstream processor 9 Web server Logs Web server Logs HDFS / S3 Batch job(s) for log analysis  This architecture is a hand-crafted micro- batch model Batch interval: ~2 hours hours minutes milliseconds Manually triggered periodic batch job Batch processor with micro-batches Latency Approach seconds Stream processor
  • 12.
    Downsides of streamprocessing with a batch engine  Very high latency (hours)  Complex architecture required: • Periodic job scheduler (e.g. Oozie) • Data loading into HDFS (e.g. Flume) • Batch processor • (When using the “lambda architecture”: a stream processor) All these components need to be implemented and maintained  Backpressure: How does the pipeline handle load spikes? 10
  • 13.
    Log event analysisusing a stream processor 11 Web server Web server Web server High throughput publish/subscribe bus Serving layer  Stream processors allow to analyze events with sub-second latency. Options: • Apache Kafka • Amazon Kinesis • MapR Streams • Google Cloud Pub/Sub Forward events immediately to pub/sub bus Stream Processor Options: • Apache Flink • Apache Beam • Apache Samza Process events in real time & update serving layer
  • 14.
    12 Real-world data isproduced in a continuous fashion. New systems like Flink and Kafka embrace streaming nature of data. Web server Kafka topic Stream processor
  • 15.
    What do weneed for replacing the “batch stack”? 13 Web server Web server Web server High throughput publish/subscribe bus Serving layer Options: • Apache Kafka • Amazon Kinesis • MapR Streams • Google Cloud Pub/Sub Forward events immediately to pub/sub bus Stream Processor Options: • Apache Flink • Google Cloud Dataflow Process events in real time & update serving layer Low latency High throughput State handling Windowing / Out of order events Fault tolerance and correctness
  • 16.
    Apache Flink stack 15 Gelly Table ML SAMOA DataSet(Java/Scala)DataStream (Java / Scala) HadoopM/R LocalClusterYARN ApacheBeam ApacheBeam Table Cascading Streaming dataflow runtime StormAPI Zeppelin CEP
  • 17.
    Needed for theuse case 16 Gelly Table ML SAMOA DataSet (Java/Scala)DataStream (Java / Scala) HadoopM/R LocalClusterYARN ApacheBeam ApacheBeam Table Cascading Streaming dataflow runtime StormAPI Zeppelin CEP
  • 18.
    Windowing / Outof order events 17 Low latency High throughput State handling Windowing / Out of order events Fault tolerance and correctness
  • 19.
    Building windows froma stream 18  “Number of visitors in the last 5 minutes per country” Web server Kafka topic Stream processor // create stream from Kafka source DataStream<LogEvent> stream = env.addSource(new KafkaConsumer()); // group by country DataStream<LogEvent> keyedStream = stream.keyBy(“country“); // window of size 5 minutes keyedStream.timeWindow(Time.minutes(5)) // do operations per window .apply(new CountPerWindowFunction());
  • 20.
    Building windows: Execution 19 Kafka Source Window Operator S S S W W W groupby country // window of size 5 minutes keyedStream.timeWindow(Time.minutes(5)); Job plan Parallel execution on the cluster Time
  • 21.
    Window types inFlink  Tumbling windows  Sliding windows  Custom windows with window assigners, triggers and evictors 20Further reading: http://flink.apache.org/news/2015/12/04/Introducing-windows.html
  • 22.
    Time-based windows 21 Stream Time ofevent Event data { “accessTime”: “1457002134”, “userId”: “1337”, “userLocation”: “UK” }  Windows are created based on the real world time when the event occurred
  • 23.
    A look atthe reality of time 22 Kafka Network delays Out of sync clocks 33 11 21 15 9  Events arrive out of order in the system  Use-case specific low watermarks for time tracking Window between 0 and 15 Stream Processor 15 Guarantee that no event with time <= 15 will arrive afterwards
  • 24.
    Time characteristics inApache Flink  Event Time • Users have to specify an event-time extractor + watermark emitter • Results are deterministic, but with latency  Processing Time • System time is used when evaluating windows • low latency  Ingestion Time • Flink assigns current system time at the sources  Pluggable, without window code changes 23
  • 25.
    State handling 24 Low latency Highthroughput State handling Windowing / Out of order events Fault tolerance and correctness
  • 26.
    State in streaming Where do we store the elements from our windows?  In stateless systems, an external state store (e.g. Redis) is needed. 25 S S S W W W Time Elements in windows are state
  • 27.
    Stream processor: Flink Managedstate in Flink  Flink automatically backups and restores state  State can be larger than the available memory  State backends: (embedded) RocksDB, Heap memory 26 Operator with windows (large state) State backend (local) Distributed File System Periodic backup / recovery Web server Kafka
  • 28.
    Managing the state How can we operate such a pipeline 24x7?  Losing state (by stopping the system) would require a replay of past events  We need a way to store the state somewhere! 27 Web server Kafka topic Stream processor
  • 29.
    Savepoints: Versioning state Savepoint: Create an addressable copy of a job’s current state.  Restart a job from any savepoint. 28 Further reading: http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/ > flink savepoint <JobID> HDFS > hdfs:///flink-savepoints/2 > flink run –s hdfs:///flink-savepoints/2 <jar> HDFS
  • 30.
    Fault tolerance and correctness 29 Lowlatency High throughput State handling Windowing / Out of order events Fault tolerance and correctness
  • 31.
    Fault tolerance instreaming  How do we ensure the results (number of visitors) are always correct?  Failures should not lead to data loss or incorrect results 30 Web server Kafka topic Stream processor
  • 32.
    Fault tolerance instreaming  at least once: ensure all operators see all events • Storm: Replay stream in failure case (acking of individual records)  Exactly once: ensure that operators do not perform duplicate updates to their state • Flink: Distributed Snapshots • Spark: Micro-batches on batch runtime 31
  • 33.
    Flink’s Distributed Snapshots Lightweight approach of storing the state of all operators without pausing the execution  Implemented using barriers flowing through the topology 32 Data Stream barrier Before barrier = part of the snapshot After barrier = Not in snapshot Further reading: http://blog.acolyer.org/2015/08/19/asynchronous-distributed-snapshots-for- distributed-dataflows/
  • 34.
    Wrap-up: Log processingexample  How to do something with the data? Windowing  How does the system handle large windows? Managed state  How do operate such a system 24x7? Safepoints  How to ensure correct results across failures? Checkpoints, Master HA 33 Web server Kafka topic Stream processor
  • 35.
    Performance: Low Latency &High Throughput 34 Low latency High throughput State handling Windowing / Out of order events Fault tolerance and correctness
  • 36.
    Performance: Introduction  Performancealways depends on your own use cases, so test it yourself!  We based our experiments on a recent benchmark published by Yahoo!  They benchmarked Storm, Spark Streaming and Flink with a production use- case (counting ad impressions) 35 Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at
  • 37.
    Yahoo! Benchmark  Countad impressions grouped by campaign  Compute aggregates over a 10 second window  Emit current value of window aggregates to Redis every second for query 36 Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at
  • 38.
    Yahoo’s Results “Storm […]and Flink […] show sub-second latencies at relatively high throughputs with Storm having the lowest 99th percentile latency. Spark streaming 1.5.1 supports high throughputs, but at a relatively higher latency.” (Quote from the blog post’s executive summary) 37 Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at
  • 39.
    Extending the benchmark Benchmark stops at Storm’s throughput limits. Where is Flink’s limit?  How will Flink’s own window implementation perform compared to Yahoo’s “state in redis windowing” approach? 38 Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming- computation-engines-at
  • 40.
    Windowing with statein Redis 39 KafkaConsumer map() filter() group windowing & caching code realtime queries
  • 41.
    Rewrite to useFlink’s own window 40 KafkaConsumer map() filter() group Flink event time windows realtime queries
  • 42.
    Results after rewrite 41 0750.000 1.500.000 2.250.000 3.000.000 3.750.000 Storm Flink Throughput: msgs/sec 400k msgs/sec
  • 43.
    Can we evengo further? 42 KafkaConsumer map() filter() group Flink event time windows Network link to Kafka cluster is bottleneck! (1GigE) Data Generator map() filter() group Flink event time windows Solution: Move data generator into job (10 GigE)
  • 44.
    Results without networkbottleneck 43 0 4.000.000 8.000.000 12.000.000 16.000.000 Storm Flink Flink (10 GigE) Throughput: msgs/sec 10 GigE end-to-end 15m msgs/sec 400k msgs/sec 3m msgs/sec
  • 45.
    Benchmark summary  Flinkachieves throughput of 15 million messages/second on 10 machines  35x higher throughput compared to Storm (80x compared to Yahoo’s runs)  Flink ran with exactly once guarantees, Storm with at least once.  Read the full report: http://data- artisans.com/extending-the-yahoo- streaming-benchmark/ 44
  • 46.
  • 47.
    Other notable features Expressive DataStream API (similar to high level APIs known from the batch world)  Flink is a full-fledged batch processor with an optimizer, managed memory, memory- aware algorithms, build-in iterations  Many libraries: Complex Event Processing (CEP), Graph Processing, Machine Learning  Integration with YARN, HBase, ElasticSearch, Kafka, MapReduce, … 46
  • 48.
    Questions?  Ask now! eMail: rmetzger@apache.org  Twitter: @rmetzger_  Follow: @ApacheFlink  Read: flink.apache.org/blog, data- artisans.com/blog/  Mailinglists: (news | user | dev)@flink.apache.org 47
  • 49.
    Apache Flink stack 48 Gelly Table ML SAMOA DataSet(Java/Scala)DataStream (Java / Scala) HadoopM/R LocalClusterYARN ApacheBeam ApacheBeam Table Cascading Streaming dataflow runtime StormAPI Zeppelin CEP
  • 50.
  • 51.
    Roadmap 2016 50  SQL/ StreamSQL  CEP Library  Managed Operator State  Dynamic Scaling  Miscellaneous
  • 52.
    Miscellaneous  Support forApache Mesos  Security • Over-the-wire encryption of RPC (akka) and data transfers (netty)  More connectors • Apache Cassandra • Amazon Kinesis  Enhance metrics • Throughput / Latencies • Backpressure monitoring • Spilling / Out of Core 51
  • 53.
    Fault Tolerance andcorrectness 52 4 3 4 2  How can we ensure the state is always in sync with the events? event counter final operator
  • 54.
    Naïve state checkpointingapproach 53  Process some records:  Stop everything, store state:  Continue processing … 0 0 0 0 1 1 2 2 Operator State a 1 b 1 c 2 d 2 a b c d
  • 55.
    Distributed Snapshots 54 0 0 0 0 1 1 00 Initial state Start processing 1 1 0 0 Trigger checkpoint Operator State a 1 b 1
  • 56.
    Distributed Snapshots 55 2 1 2 0 OperatorState a 1 b 1 c 2 Barrier flows with events 2 1 2 2 Checkpoint completed Operator State a 1 b 1 c 2 d 2  Valid snapshot without stopping the topology  Multiple checkpoints can be in-flight Complete, consistent state snapshot
  • 57.
    Analysis of naïveapproach  Introduces latency  Reduces throughput  Can we create a correct snapshot while keeping the job running?  Yes! By creating a distributed snapshot 56
  • 58.
    Handling Backpressure 57 Slow down upstream operators Backpressuremight occur when: • Operators create checkpoints • Windows are evaluated • Operators depend on external resources • JVMs do Garbage Collection Operator not able to process incoming data immediately
  • 59.
    Handling Backpressure 58 Sender Sender Receiver Receiver Sender doesnot have any empty buffers available: Slowdown Network transfer (Netty) or local buffer exchange (when S and R are on the same machine) • Data sources slow down pulling data from their underlying system (Kafka or similar queues) Full buffer Empty buffer
  • 60.
    How do latencyand throughput affect each other? flink.apache.org 5930 Machines, one repartition step Sender Sender Receiver Receiver Send buffer when full or timeout • High throughput by batching events in network buffers • Filling the buffers introduces latency • Configurable buffer timeout
  • 61.
    Aggregate throughput forstream record grouping 60 0 10.000.000 20.000.000 30.000.000 40.000.000 50.000.000 60.000.000 70.000.000 80.000.000 90.000.000 100.000.000 Flink, no fault tolerance Flink, exactly once Storm, no fault tolerance Storm, at least once aggregate throughput of 83 million elements per second 8,6 million elements/s 309k elements/s  Flink achieves 260x higher throughput with fault tolerance 30 machines, 120 cores, Google Compute
  • 62.
  • 63.
    The building blocks:Summary 62 Low latency High throughput State handling Windowing / Out of order events Fault tolerance and correctness • Tumbling / sliding windows • Event time / processing time • Low watermarks for out of order events • Managed operator state for backup/recovery • Large state with RocksDB • Savepoints for operations • Exactly-once semantics for managed operator state • Lightweight, asynchronous distributed snapshotting algorithm • Efficient, pipelined runtime • no per-record operations • tunable latency / throughput tradeoff • Async checkpoints
  • 64.
    Low Watermarks  Weperiodically send low-watermarks through the system to indicate the progression of event time. 63 For more details: “MillWheel: Fault-Tolerant Stream Processing at Internet Scale” by T. Akidau et. al. 33 11 28 21 15 958 Guarantee that no event with time <= 5 will arrive afterwards Window between 0 and 15 Window is evaluated when watermarks arrive
  • 65.
    Low Watermarks 64 For moredetails: “MillWheel: Fault-Tolerant Stream Processing at Internet Scale” by T. Akidau et. al. Operator 35 Operators with multiple inputs always forward the lowest watermark
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
    Fault Tolerance instreaming  Failure with “at least once”: replay 69 4 3 4 2 Restore from: Final result: 7 5 9 7
  • 71.
    Fault Tolerance instreaming  Failure with “exactly once”: state restore 70 1 1 2 2 Restore from: Final result: 4 3 7 7
  • 72.
    Latency in streamrecord grouping 71 Data Generator Receiver: Throughput / Latency measure • Measure time for a record to travel from source to sink 0,00 5,00 10,00 15,00 20,00 25,00 30,00 Flink, no fault tolerance Flink, exactly once Storm, at least once Median latency 25 ms 1 ms 0,00 10,00 20,00 30,00 40,00 50,00 60,00 Flink, no fault tolerance Flink, exactly once Storm, at least once 99th percentile latency 50 ms
  • 73.
    Savepoints: Simplifying Operations Streaming jobs usually run 24x7 (unlike batch).  Application bug fixes: Replay your job from a certain point in time (savepoint)  Flink bug fixes  Maintenance and system migration  What-If simulations: Run different implementations of your code against a savepoint 72
  • 74.
    Pipelining 73 Basic building blockto “keep the data moving” • Low latency • Operators push data forward • Data shipping as buffers, not tuple- wise • Natural handling of back-pressure
  • 75.
    Watch the videowith slide synchronization on InfoQ.com! http://www.infoq.com/presentations/stream- processing-apache-flink