Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
2. What is Apache Flink?
2
Apache Flink is an open source stream
processing framework.
• Event Time Handling
• State & Fault Tolerance
• Low Latency
• High Throughput
Developed at the Apache Software Foundation.
3. Recent History
3
April ‘14 December ‘14
v0.5 v0.6 v0.7
March ‘16
Project
Incubation
Top Level
Project
v0.8 v0.10
Release
1.0
4. Flink Stack
4
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
14. Streaming
14
Until now, stream processors were less mature
than their batch counterparts. This led to:
• in-house solutions,
• abuse of batch processors,
• Lambda architectures
This is no longer needed with new generation
stream processors like Flink.
15. Streaming All the Way
15
Streaming
Job
Serving
Layer
Message Queue
(e.g. Apache Kafka)
Durability and Replay
Stream Processor
(e.g. Apache Flink)
Consistent Processing
16. Building Blocks of Flink
16
Explicit Handling
of Time
State & Fault
Tolerance
Performance
18. Tumbling Windows (No Overlap)
18
Time
e.g.“Count over the last 5 minutes”,
“Average over the last 100 records”
19. Sliding Windows (with Overlap)
19
Time
e.g. “Count over the last 5 minutes,
updated each minute.”,
“Average over the last 100 elements,
updated every 10 elements”
20. Explicit Handling of Time
20
DataStream<ColorEvent> counts = env
.addSource(new KafkaConsumer(…))
.keyBy("color")
.timeWindow(Time.minutes(60))
.apply(new CountPerWindow());
Time is explicit
in your program
23. Notions of Time
23
12:23 am
Event Time
1:37 pm
Processing Time
Time measured by system clock
Time when event happened.
24. 1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Out of Order Events
24
25. Out of Order Events
25
1st burst of events
2nd burst of events
Event Time
Windows
Processing Time
Windows
29. Processing Semantics
29
At-least once
May over-count
after failure
Exactly Once
Correct counts
after failures
End-to-end exactly once
Correct counts in external system
(e.g. DB, file system) after failure
30. Processing Semantics
30
• Flink guarantees exactly once (can be configured
for at-least once if desired)
• End-to-end exactly once with specific sources
and sinks (e.g. Kafka -> Flink -> HDFS)
• Internally, Flink periodically takes consistent
snapshots of the state without ever stopping
computation
31. Yahoo! Benchmark
31
• Storm 0.10, Spark Streaming 1.5, and Flink 0.10
benchmark by Storm team at Yahoo!
• Focus on measuring end-to-end latency
at low throughputs (~ 200k events/sec)
• First benchmark modelled after a real application
https://yahooeng.tumblr.com/post/135321837876/
benchmarking-streaming-computation-engines-at
32. Yahoo! Benchmark
32
• Count ad impressions grouped by campaign
• Compute aggregates over last 10 seconds
• Make aggregates available for queries in Redis
34. Extending the Benchmark
34
• Great starting point, but benchmark stops at
low write throughput and programs are not
fault-tolerant
• Extend benchmark to high volumes and
Flink’s built-in fault-tolerant state
http://data-artisans.com/extending-the-yahoo-streaming-
benchmark/
36. Throughput (Higher is Better)
36
5.000.000 10.000.000 15.000.000
Maximum Throughput (events/sec)
0
Flink
w/o Kafka
Flink
w/ Kafka
Storm
w/ Kafka
Limited by bandwidth between
Kafka and Flink cluster
37. Summary
37
• Stream processing is gaining momentum, the right
paradigm for continuous data applications
• Choice of framework is crucial – even seemingly
simple applications become complex at scale and
in production
• Flink offers unique combination of efficiency,
consistency and event time handling
40. Upcoming Features
40
• SQL: ongoing work in collaboration with Apache
Calcite
• Dynamic Scaling: adapt resources to stream volume,
scale up for historical stream processing
• Queryable State: query the state inside the stream
processor