In this talk about Apache Flink we will touch on three main things, an introductory look at Flink, a look under the hood and a demo.
* In the introduction we will briefly look at the history of Flink and then go on to the API and different use cases. Here we will also see how it can be deployed in practice and what some of the pitfalls in a cluster setting can be.
* In the second section we will look at the streaming execution engine that lies at the heart of Flink. Here we will see what makes it tick and also what distinguishes it from other approaches, such as the mini-batch execution model.
Ufuk Celebi - PMC member at Apache Flink and co-founder and software engineer at data Artisans
* In the final section we will see a live demo of a fault-tolerant streaming job that performs analysis of the wikipedia edit-stream.
2. Recent History
April ‘14 December ‘14
v0.5 v0.6 v0.7
April ‘15
Project
Incubation
Top Level
Project
v0.8 v0.9
Currently moving towards 0.10 and 1.0 release.
3. What is Flink?
Streaming
Topologies
Stream
Time
Window
Count
Low Latency
Long Batch Pipelines
Resource Utilization
1.2
1.4
1.5
1.2
0.8
0.9
1.0
0.8
Rating Matrix User Matrix Item Matrix
1.5
1.7
1.2
0.6
1.0
1.1
0.8
0.4
W X Y ZW X Y Z
A
B
C
D
4.0
4.5
5.0
3.5
2.0
3.5
4.0
2.0
1.0
= X
User
Machine Learning
Iterative Algorithms
Graph Analysis
53
1 2
4
0.5
0.2 0.9
0.3
0.1
0.4
0.7
Mutable State
4. Overview
Deployment
Local (Single JVM) · Cluster (Standalone, YARN)
DataStream API
Unbounded Data
DataSet API
Bounded Data
Runtime
Distributed Streaming Data Flow
Libraries
Machine Learning · Graph Processing · SQL-like API
7. Cornerstones of Flink
Low Latency for fast results.
High Throughput to handle many events per second.
Exactly-once guarantees for correct results.
Intuitive APIs for productivity.
8. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
9. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
10. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
11. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
12. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
13. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
14. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
15. DataStream API
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment()
DataStream<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataStream Windowed WordCount
DataStream<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.keyBy(0) // [word, [1, 1, …]] for 10 seconds
.timeWindow(Time.of(10, TimeUnit.SECONDS))
.sum(1); // sum per word per 10 second window
counts.print();
env.execute();
20. Streaming Fault Tolerance
At Least Once
• Ensure that all operators see all events.
Exactly Once
• Ensure that all operators see all events.
• Do not perform duplicates updates to operator state.
21. Streaming Fault Tolerance
At Least Once
• Ensure that all operators see all events.
Exactly Once
• Ensure that all operators see all events.
• Do not perform duplicates updates to operator state.
Flink guarantees exactly once processing.
22. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
23. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
24. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
25. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
26. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
27. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
28. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
29. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
30. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
31. Distributed Snaphots
Barriers flow through the topology in line with data.
Flink guarantees exactly once processing.
Part of
snapshot
32. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: State 1:
Source 2: State 2:
Source 3: Sink 1:
Source 4: Sink 2:
Offset: 6791
Offset: 7252
Offset: 5589
Offset: 6843
33. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: State 1:
Source 2: State 2:
Source 3: Sink 1:
Source 4: Sink 2:
Offset: 6791
Offset: 7252
Offset: 5589
Offset: 6843
Start Checkpoint
Message
34. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
Emit Barriers
Acknowledge with
Position
35. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
Received barrier
at each input
36. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: 6791 State 1:
Source 2: 7252 State 2:
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
s1 Write Snapshot
of its state
Received barrier
at each input
37. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1:
Source 4: 6843 Sink 2:
s1
Acknowledge with
pointer to state
s2
38. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1: ACK
Source 4: 6843 Sink 2: ACK
s1 s2
Acknowledge Checkpoint
Received barrier
at each input
39. Distributed Snaphots
Flink guarantees exactly once processing.
JobManager
Master
State Backend
Ceckpoint Data
Source 1: 6791 State 1: PTR1
Source 2: 7252 State 2: PTR2
Source 3: 5589 Sink 1: ACK
Source 4: 6843 Sink 2: ACK
s1 s2
40. Operator State
User-defined state
• Flink’s transformations are long running operators
• Feel free to keep objects around
• Hooks to include into system’s checkpoint
Windowed streams
• Time, count, and data-driven windows
• Managed by the system
41. Batch on Streaming
DataStream API
Unbounded Data
DataSet API
Bounded Data
Runtime
Distributed Streaming Data Flow
Libraries
Machine Learning · Graph Processing · SQL-like API
42. Batch on Streaming
Run a bounded stream (data set) on
a stream processor.
Bounded
data set
Unbounded
data stream
43. Batch on Streaming
Stream Windows
Pipelined
Data Exchange
Global View
Pipelined or Blocking
Data Exchange
Infinite Streams Finite Streams
Run a bounded stream (data set) on
a stream processor.
47. DataSet API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
48. DataStream API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
49. DataStream API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
50. DataStream API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
51. DataStream API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
52. DataStream API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
53. DataStream API
ExecutionEnvironment env = ExecutionEnvironment
.getExecutionEnvironment()
DataSet<String> data = env.fromElements(
"O Romeo, Romeo! wherefore art thou Romeo?”, ...);
// DataSet WordCount
DataSet<Tuple2<String, Integer>> counts = data
.flatMap(new SplitByWhitespace()) // (word, 1)
.groupBy(0) // [word, [1, 1, …]]
.sum(1); // sum per word for all occurrences
counts.print();
54. Batch-specific optimizations
Managed memory
• On- and off-heap memory
• Internal operators (e.g. join or sort) with out-of-core
support
• Serialization stack for user-types
Cost-based optimizer
• Program adapts to changing data size