Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

Stephan Ewen
@stephanewen
Streaming Analytics
with Apache Flink

Apache Flink Stack
2
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
LibrariesApache Beam
Streaming and batch as first class citizens.

Today
3
Streaming and batch as first class citizens.
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow
LibrariesApache Beam

4
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced

Continuous Processing with Batch
 Continuous
ingestion
 Periodic (e.g.,
hourly) files
 Periodic batch
jobs
5

λ Architecture
 "Batch layer": what
we had before
 "Stream layer":
approximate early
results
6

A Stream Processing Pipeline
7
collect log analyze serve & store

Programs and Dataflows
8
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
Source
[1]
map()
[1]
keyBy()/
window()/
apply()
[1]
Sink
[1]
Source
[2]
map()
[2]
keyBy()/
window()/
apply()
[2]
Streaming
Dataflow

Why does Flink stream flink?
9
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of data
Works on real-time
and historic data
True
Streaming
Event Time
APIs
Libraries
Stateful
Streaming
Globally consistent
savepoints
Exactly-once semantics
for fault tolerance
Windows &
user-defined state
Flexible windows
(time, count, session, roll-your own)
Complex Event Processing

Streaming Analytics by Example
10

Time-Windowed Aggregations
11
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream: DataStream[Event] = env.addSource(…)
stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum("measure")

Time-Windowed Aggregations
12
stream
.keyBy("sensor")
.timeWindow(Time.seconds(60), Time.seconds(5))
.sum("measure")

Session-Windowed Aggregations
13
stream
.keyBy("sensor")
.window(EventTimeSessionWindows.withGap(Time.seconds(60)))
.max("measure")

Pattern Detection
14
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
stream
.keyBy("producer")
.flatMap(new RichFlatMapFuncion[Event, Alert]() {
lazy val state: ValueState[Int] = getRuntimeContext.getState(…)
def flatMap(event: Event, out: Collector[Alert]) = {
val newState = state.value() match {
case 0 if (event.evtType == 0) => 1
case x => out.collect(Alert(event.msg, x)); 0
}
state.update(newState)
}
})

Pattern Detection
15
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
stream
.keyBy("producer")
.flatMap(new RichFlatMapFuncion[Event, Alert]() {
lazy val state: ValueState[Int] = getRuntimeContext.getState(…)
def flatMap(event: Event, out: Collector[Alert]) = {
val newState = state.value() match {
case x => out.collect(Alert(event.msg, x)); 0
}
state.update(newState)
}
})
Embedded key/value
state store

Many more
 Joining streams (e.g. combine readings from sensor)
 Detecting Patterns (CEP)
 Applying (changing) rules or models to events
 Training and applying online machine learning
models
 …
16

18
The biggest change in moving from
batch to streaming is
handling time explicitly

Example: Windowing by Time
19
case class Event(id: String, measure: Double, timestamp: Long)
stream
.keyBy("id")
.sum("measure")

Different Notions of Time
20
Event Producer Message Queue
Flink
Data Source
Flink
Window Operator
partition 1
partition 2
Event
Time
Ingestion
Time
Window
Processing
Time

Time and the Dataflow Model
 Event Time semantics in Flink follow the
Dataflow model (Apache Beam (incub.))
 See previous talk by Frances Perry
& Tyler Akidau
 For the sake of time (no pun intended) I, only briefly
recapitulate on the basic concept
21

1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Event Time vs. Processing Time
22

Processing Time
23
env.setStreamTimeCharacteristic(ProcessingTime)
stream
.keyBy("id")
.sum("measure")

Ingestion Time
24
env.setStreamTimeCharacteristic(IngestionTime)
stream
.keyBy("id")
.sum("measure")

Event Time
25
env.setStreamTimeCharacteristic(EventTime)
stream
.keyBy("id")
.sum("measure")

Event Time
26
env.setStreamTimeCharacteristic(EventTime)
val tsStream = stream.assignTimestampsAndWatermarks(
new MyTimestampsAndWatermarkGenerator())
tsStream
.keyBy("id")
.sum("measure")

Watermarks
27
7
W(11)W(17)
11159121417122220 171921
Watermark
Event
Event timestamp
Stream (in order)
7
W(11)W(20)
Watermark
991011141517
Event
Event timestamp
1820 192123
Stream (out of order)

Watermarks in Parallel
28
Source
(1)
Source
(2)
map
(1)
map
(2)
window
(1)
window
(2)
29
29
17
14
14
29
14
14
W(33)
W(17)
W(17)
A|30B|31
C|30
D|15
E|30
F|15G|18H|20
K|35
Watermark
Event Time
at the operator
Event
[id|timestamp]
Event Time
at input streams
Watermark
Generation
M|39N|39Q|44
L|22O|23R|37

Per Kafka Partition Watermarks
29
Source
(1)
Source
(2)
map
(1)
map
(2)
window
(1)
window
(2)
33
17
29
29
17
14
14
29
14
14
W(33)
W(17)
W(17)
A|30B|73
C|33
D|18
E|31
F|15G|91H|94
K|77
Watermark
Generation
L|35N|39
O|97 M|89
I|21Q|23
T|99 S|97

Matters of State
(Fault Tolerance, Reinstatements, etc)
30

Back to the Aggregation Example
31
val stream: DataStream[Event] = env.addSource(
new FlinkKafkaConsumer09(topic, schema, properties))
stream
.keyBy("id")
.sum("measure")
Stateful

Fault Tolerance
 Prevent data loss (reprocess lost in-flight events)
 Recover state consistency (exactly-once semantics)
• Pending windows & user-defined (key/value) state
 Checkpoint based fault tolerance
• Periodically create checkpoints
• Recovery: resume from last completed checkpoint
• Async. Barrier Snapshots (ABS) Algorithm
32

Checkpoints
33
data stream
event
newer records older records
State of the dataflow
at point Y
State of the dataflow
at point X

Checkpoint Barriers
 Markers, injected into the streams
34

Savepoints
 A "Checkpoint" is a globally consistent point-in-time snapshot
of the streaming program (point in stream, state)
 A "Savepoint" is a user-triggered retained checkpoint
 Streaming programs can start from a savepoint
37
Savepoint B Savepoint A

(Re)processing data (in batch)
 Re-processing data (what-if exploration, to correct bugs, etc.)
 Usually by running a batch job with a set of old files
 Tools that map files to times
38
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am…
Collection of files, by ingestion time
2016-3-11
10:00pm
To the batch
processor

Unclear Batch Boundaries
39
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am…
2016-3-11
10:00pm
To the batch
processor
?
?
What about sessions across batches?

(Re)processing data (streaming)
 Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
 Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
40
Savepoint
Run new streaming
program from savepoint

Continuous Data Sources
41
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm …
partition
partition
Savepoint
Savepoint
Stream of Kafka Partitions
Stream view over sequence of files
Kafka offsets +
Operator state
File mod timestamp +
File position +
Operator state
WIP (target: Flink 1.1)

Complex Event Processing Primer
Demo Time
42

Demo Scenario
Pattern validation & violation detection:
 Events should follow a certain pattern,
or an alert should be raised
 Think cybersecurity, process monitoring, etc
43

An Outlook on Things to Come
44

Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
data integration & distribution
platform
See talks by at

Roadmap
 Dynamic Scaling, Resource Elasticity
 Stream SQL
 CEP enhancements
 Incremental & asynchronous state snapshotting
 Mesos support
 More connectors, end-to-end exactly once
 API enhancements (e.g., joins, slowly changing inputs)
 Security (data encryption, Kerberos with Kafka)
46

47
Apache Flink Meetup - Thursday, April, 28th

Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org

We are hiring!
data-artisans.com/careers

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

More Related Content

What's hot

Similar to Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

More from confluent

Recently uploaded

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen