This document provides an overview of Apache Flink and streaming analytics. It discusses key concepts in streaming such as event time vs processing time, watermarks, windows, and fault tolerance using checkpoints and savepoints. It provides examples of time-windowed and session-windowed aggregations as well as pattern detection using state. The document also covers mixing event time and processing time, window triggers, and reprocessing data from savepoints in streaming jobs.
2. Apache Flink Stack
2
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
3. Today
3
Streaming and batch as first class citizens.
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
Libraries
4. 4
Streaming is the next programming paradigm
for data applications, and you need to start
thinking in terms of streams.
5. 5
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
6. Continuous Processing with Batch
Continuous
ingestion
Periodic (e.g.,
hourly) files
Periodic batch
jobs
6
7. λ Architecture
"Batch layer": what
we had before
"Stream layer":
approximate early
results
7
9. A brief History of Flink
9
January ‘10 December ‘14
v0.5 v0.6 v0.7
March ‘16
Flink Project
Incubation
Top Level
Project
v0.8 v0.10
Release
1.0
Project
Stratosphere
(Flink precursor)
v0.9
April ‘14
10. A brief History of Flink
10
January ‘10 December ‘14
v0.5 v0.6 v0.7
March ‘16
Flink Project
Incubation
Top Level
Project
v0.8 v0.10
Release
1.0
Project
Stratosphere
(Flink precursor)
v0.9
April ‘14
The academia gap:
Reading/writing papers,
teaching, worrying about thesis
Realizing this might be
interesting to people
beyond academia
(even more so, actually)
12. What makes Flink flink?
12
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing
14. Time-Windowed Aggregations
14
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum("measure")
15. Time-Windowed Aggregations
15
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.timeWindow(Time.seconds(60), Time.seconds(5))
.sum("measure")
16. Session-Windowed Aggregations
16
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.window(EventTimeSessionWindows.withGap(Time.seconds(60)))
.max("measure")
17. Session-Windowed Aggregations
17
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.window(EventTimeSessionWindows.withGap(Time.seconds(60)))
.max("measure")
Flink 1.1 syntax
18. Pattern Detection
18
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("producer")
.flatMap(new RichFlatMapFuncion[Event, Alert]() {
lazy val state: ValueState[Int] = getRuntimeContext.getState(…)
def flatMap(event: Event, out: Collector[Alert]) = {
val newState = state.value() match {
case 0 if (event.evtType == 0) => 1
case 1 if (event.evtType == 1) => 0
case x => out.collect(Alert(event.msg, x)); 0
}
state.update(newState)
}
})
19. Pattern Detection
19
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("producer")
.flatMap(new RichFlatMapFuncion[Event, Alert]() {
lazy val state: ValueState[Int] = getRuntimeContext.getState(…)
def flatMap(event: Event, out: Collector[Alert]) = {
val newState = state.value() match {
case 0 if (event.evtType == 0) => 1
case 1 if (event.evtType == 1) => 0
case x => out.collect(Alert(event.msg, x)); 0
}
state.update(newState)
}
})
Embedded key/value
state store
20. Many more
Joining streams (e.g. combine readings from sensor)
Detecting Patterns (CEP)
Applying (changing) rules or models to events
Training and applying online machine learning
models
…
20
23. Example: Windowing by Time
23
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
24. Example: Windowing by Time
24
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
25. Different Notions of Time
25
Event Producer Message Queue
Flink
Data Source
Flink
Window Operator
partition 1
partition 2
Event
Time
Ingestion
Time
Window
Processing
Time
26. 1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Event Time vs. Processing Time
26
27. Out of order Streams
27
Events occur on devices
Queue / Log
Events analyzed in
a
data streaming
system
Stream Analysis
Events stored in a log
31. Out of order Streams
31
Out of order !!!
First burst of events
Second burst of events
32. Out of order Streams
32
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of events
Second burst of events
33. Processing Time
33
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(ProcessingTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Window by operator's processing time
34. Ingestion Time
34
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(IngestionTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
35. Event Time
35
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
36. Event Time
36
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
val tsStream = stream.assignAscendingTimestamps(_.timestamp)
tsStream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
37. Event Time
37
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
val tsStream = stream.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())
tsStream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
40. Mixing Event Time Processing Time
40
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(EventTime)
val stream: DataStream[Event] = env.addSource(…)
val tsStream = stream.assignAscendingTimestamps(_.timestamp)
tsStream
.keyBy("id")
.window(SlidingEventTimeWindows.of(seconds(15), seconds(5))
.trigger(new MyTrigger())
.sum("measure")
41. Window Triggers
React to any combination of
• Event Time
• Processing Time
• Event data
Example of a mixed EventTime / Proc. Time Trigger:
• Trigger when event time reaches window end
OR
• When processing time reaches window end plus 30 secs.
41
42. Trigger example
42
.sum("measure")
public class EventTimeTrigger extends Trigger<Object, TimeWindow> {
public TriggerResult onElement(Object evt, long time,
TimeWindow window, TriggerContext ctx) {
ctx.registerEventTimeTimer(window.maxTimestamp());
ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000);
return TriggerResult.CONTINUE;
}
public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) {
return TriggerResult.FIRE_AND_PURGE;
}
public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) {
return TriggerResult.FIRE_AND_PURGE;
}
43. Trigger example
43
.sum("measure")
public class EventTimeTrigger extends Trigger<Object, TimeWindow> {
public TriggerResult onElement(Object evt, long time,
TimeWindow window, TriggerContext ctx) {
ctx.registerEventTimeTimer(window.maxTimestamp());
ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000);
return TriggerResult.CONTINUE;
}
public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) {
return TriggerResult.FIRE_AND_PURGE;
}
public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) {
return TriggerResult.FIRE_AND_CONTINUE;
}
47. Back to the Aggregation Example
47
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(
new FlinkKafkaConsumer09(topic, schema, properties))
stream
.keyBy("id")
.timeWindow(Time.seconds(15), Time.seconds(5))
.sum("measure")
Stateful
48. Fault Tolerance
Prevent data loss (reprocess lost in-flight events)
Recover state consistency (exactly-once semantics)
• Pending windows & user-defined (key/value) state
Checkpoint based fault tolerance
• Periodicaly create checkpoints
• Recovery: resume from last completed checkpoint
• Async. Barrier Snapshots (ABS) Algorithm
48
53. Savepoints
A "Checkpoint" is a globally consistent point-in-time snapshot
of the streaming program (point in stream, state)
A "Savepoint" is a user-triggered retained checkpoint
Streaming programs can start from a savepoint
53
Savepoint B Savepoint A
54. (Re)processing data (in batch)
Re-processing data (what-if exploration, to correct bugs, etc.)
Usually by running a batch job with a set of old files
Tools that map files to times
54
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am…
Collection of files, by ingestion time
2016-3-11
10:00pm
To the batch
processor
55. Unclear Batch Boundaries
55
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am…
2016-3-11
10:00pm
To the batch
processor
?
?
What about sessions across batches?
56. (Re)processing data (streaming)
Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
56
Savepoint
Run new streaming
program from savepoint
57. Continuous Data Sources
57
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm …
partition
partition
Savepoint
Savepoint
Stream of Kafka Partitions
Stream view over sequence of files
Kafka offsets +
Operator state
File mod timestamp +
File position +
Operator state
WIP (target: Flink 1.1)
58. Upgrading Programs
A program starting from a savepoint can differ from the
program that created the savepoint
• Unique operator names match state and operator
Mechanism be used to fix bugs in programs, to evolve
programs, parameters, libraries, …
58
59. State Backends
Large state is a collection of key/value pairs
State backend defines what data structure holds the
state, plus how it is snapshotted
Most common choices
• Main memory – snapshots to master
• Main memory – snapshots to dist. filesystem
• RocksDB – snapshots to dist. filesystem
59
61. Example: Temperature Monitoring
Receiving temperature an power events
from sensors
Looking for temperatures repeatedly
exceeding thresholds within a
short time period (10 secs)
61