Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Stephan Ewen
@stephanewen
Streaming Analytics
with Apache Flink
Apache Flink Stack
2
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime
Distributed Streaming Data Flow...
Today
3
Streaming and batch as first class citizens.
DataStream API
Stream Processing
DataSet API
Batch Processing
Runtime...
4
Streaming technology is enabling the obvious:
continuous processing on data that is
continuously produced
Continuous Processing with Batch
 Continuous
ingestion
 Periodic (e.g.,
hourly) files
 Periodic batch
jobs
5
λ Architecture
 "Batch layer": what
we had before
 "Stream layer":
approximate early
results
6
A Stream Processing Pipeline
7
collect log analyze serve & store
Programs and Dataflows
8
Source
Transformation
Transformation
Sink
val lines: DataStream[String] = env.addSource(new Flink...
Why does Flink stream flink?
9
Low latency
High Throughput
Well-behaved
flow control
(back pressure)
Make more sense of da...
Streaming Analytics by Example
10
Time-Windowed Aggregations
11
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getEx...
Time-Windowed Aggregations
12
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.getEx...
Session-Windowed Aggregations
13
case class Event(sensor: String, measure: Double)
val env = StreamExecutionEnvironment.ge...
Pattern Detection
14
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
val strea...
Pattern Detection
15
case class Event(producer: String, evtType: Int, msg: String)
case class Alert(msg: String)
val strea...
Many more
 Joining streams (e.g. combine readings from sensor)
 Detecting Patterns (CEP)
 Applying (changing) rules or ...
(It's) About Time
17
18
The biggest change in moving from
batch to streaming is
handling time explicitly
Example: Windowing by Time
19
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvi...
Different Notions of Time
20
Event Producer Message Queue
Flink
Data Source
Flink
Window Operator
partition 1
partition 2
...
Time and the Dataflow Model
 Event Time semantics in Flink follow the
Dataflow model (Apache Beam (incub.))
 See previou...
1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episod...
Processing Time
23
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.get...
Ingestion Time
24
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getE...
Event Time
25
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecu...
Event Time
26
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutionEnvironment.getExecu...
Watermarks
27
7
W(11)W(17)
11159121417122220 171921
Watermark
Event
Event timestamp
Stream (in order)
7
W(11)W(20)
Waterma...
Watermarks in Parallel
28
Source
(1)
Source
(2)
map
(1)
map
(2)
window
(1)
window
(2)
29
29
17
14
14
29
14
14
W(33)
W(17)
...
Per Kafka Partition Watermarks
29
Source
(1)
Source
(2)
map
(1)
map
(2)
window
(1)
window
(2)
33
17
29
29
17
14
14
29
14
1...
Matters of State
(Fault Tolerance, Reinstatements, etc)
30
Back to the Aggregation Example
31
case class Event(id: String, measure: Double, timestamp: Long)
val env = StreamExecutio...
Fault Tolerance
 Prevent data loss (reprocess lost in-flight events)
 Recover state consistency (exactly-once semantics)...
Checkpoints
33
data stream
event
newer records older records
State of the dataflow
at point Y
State of the dataflow
at poi...
Checkpoint Barriers
 Markers, injected into the streams
34
Checkpoint Procedure
35
Checkpoint Procedure
36
Savepoints
 A "Checkpoint" is a globally consistent point-in-time snapshot
of the streaming program (point in stream, sta...
(Re)processing data (in batch)
 Re-processing data (what-if exploration, to correct bugs, etc.)
 Usually by running a ba...
Unclear Batch Boundaries
39
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3...
(Re)processing data (streaming)
 Draw savepoints at times that you will want to start new jobs
from (daily, hourly, …)
 ...
Continuous Data Sources
41
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-...
Complex Event Processing Primer
Demo Time
42
Demo Scenario
Pattern validation & violation detection:
 Events should follow a certain pattern,
or an alert should be ra...
An Outlook on Things to Come
44
Flink in the wild
45
30 billion events daily 2 billion events in
10 1Gb machines
data integration & distribution
platform
...
Roadmap
 Dynamic Scaling, Resource Elasticity
 Stream SQL
 CEP enhancements
 Incremental & asynchronous state snapshot...
47
Apache Flink Meetup - Thursday, April, 28th
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016
Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!
data-artisans.com/careers
50
I stream, do you?
Upcoming SlideShare
Loading in …5
×

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

5,093 views

Published on

Flink and Kafka are popular components to build an open source stream processing infrastructure. We present how Flink integrates with Kafka to provide a platform with a unique feature set that matches the challenging requirements of advanced stream processing applications. In particular, we will dive into the following points:

Flink’s support for event-time processing, how it handles out-of-order streams, and how it can perform analytics on historical and real-time streams served from Kafka’s persistent log using the same code. We present Flink’s windowing mechanism that supports time-, count- and session- based windows, and intermixing event and processing time semantics in one program.
How Flink’s checkpointing mechanism integrates with Kafka for fault-tolerance, for consistent stateful applications with exactly-once semantics.
We will discuss “”Savepoints””, which allows users to save the state of the streaming program at any point in time. Together with a durable event log like Kafka, savepoints allow users to pause/resume streaming programs, go back to prior states, or switch to different versions of the program, while preserving exactly-once semantics.
We explain the techniques behind the combination of low-latency and high throughput streaming, and how latency/throughput trade-off can configured.
We will give an outlook on current developments for streaming analytics, such as streaming SQL and complex event processing.

Published in: Engineering

Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen

  1. 1. Stephan Ewen @stephanewen Streaming Analytics with Apache Flink
  2. 2. Apache Flink Stack 2 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow LibrariesApache Beam Streaming and batch as first class citizens.
  3. 3. Today 3 Streaming and batch as first class citizens. DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow LibrariesApache Beam
  4. 4. 4 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  5. 5. Continuous Processing with Batch  Continuous ingestion  Periodic (e.g., hourly) files  Periodic batch jobs 5
  6. 6. λ Architecture  "Batch layer": what we had before  "Stream layer": approximate early results 6
  7. 7. A Stream Processing Pipeline 7 collect log analyze serve & store
  8. 8. Programs and Dataflows 8 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Source [2] map() [2] keyBy()/ window()/ apply() [2] Streaming Dataflow
  9. 9. Why does Flink stream flink? 9 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  10. 10. Streaming Analytics by Example 10
  11. 11. Time-Windowed Aggregations 11 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum("measure")
  12. 12. Time-Windowed Aggregations 12 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(60), Time.seconds(5)) .sum("measure")
  13. 13. Session-Windowed Aggregations 13 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .window(EventTimeSessionWindows.withGap(Time.seconds(60))) .max("measure")
  14. 14. Pattern Detection 14 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) } })
  15. 15. Pattern Detection 15 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) } }) Embedded key/value state store
  16. 16. Many more  Joining streams (e.g. combine readings from sensor)  Detecting Patterns (CEP)  Applying (changing) rules or models to events  Training and applying online machine learning models  … 16
  17. 17. (It's) About Time 17
  18. 18. 18 The biggest change in moving from batch to streaming is handling time explicitly
  19. 19. Example: Windowing by Time 19 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  20. 20. Different Notions of Time 20 Event Producer Message Queue Flink Data Source Flink Window Operator partition 1 partition 2 Event Time Ingestion Time Window Processing Time
  21. 21. Time and the Dataflow Model  Event Time semantics in Flink follow the Dataflow model (Apache Beam (incub.))  See previous talk by Frances Perry & Tyler Akidau  For the sake of time (no pun intended) I, only briefly recapitulate on the basic concept 21
  22. 22. 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time Event Time vs. Processing Time 22
  23. 23. Processing Time 23 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(ProcessingTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  24. 24. Ingestion Time 24 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(IngestionTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  25. 25. Event Time 25 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  26. 26. Event Time 26 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator()) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  27. 27. Watermarks 27 7 W(11)W(17) 11159121417122220 171921 Watermark Event Event timestamp Stream (in order) 7 W(11)W(20) Watermark 991011141517 Event Event timestamp 1820 192123 Stream (out of order)
  28. 28. Watermarks in Parallel 28 Source (1) Source (2) map (1) map (2) window (1) window (2) 29 29 17 14 14 29 14 14 W(33) W(17) W(17) A|30B|31 C|30 D|15 E|30 F|15G|18H|20 K|35 Watermark Event Time at the operator Event [id|timestamp] Event Time at input streams Watermark Generation M|39N|39Q|44 L|22O|23R|37
  29. 29. Per Kafka Partition Watermarks 29 Source (1) Source (2) map (1) map (2) window (1) window (2) 33 17 29 29 17 14 14 29 14 14 W(33) W(17) W(17) A|30B|73 C|33 D|18 E|31 F|15G|91H|94 K|77 Watermark Generation L|35N|39 O|97 M|89 I|21Q|23 T|99 S|97
  30. 30. Matters of State (Fault Tolerance, Reinstatements, etc) 30
  31. 31. Back to the Aggregation Example 31 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource( new FlinkKafkaConsumer09(topic, schema, properties)) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure") Stateful
  32. 32. Fault Tolerance  Prevent data loss (reprocess lost in-flight events)  Recover state consistency (exactly-once semantics) • Pending windows & user-defined (key/value) state  Checkpoint based fault tolerance • Periodically create checkpoints • Recovery: resume from last completed checkpoint • Async. Barrier Snapshots (ABS) Algorithm 32
  33. 33. Checkpoints 33 data stream event newer records older records State of the dataflow at point Y State of the dataflow at point X
  34. 34. Checkpoint Barriers  Markers, injected into the streams 34
  35. 35. Checkpoint Procedure 35
  36. 36. Checkpoint Procedure 36
  37. 37. Savepoints  A "Checkpoint" is a globally consistent point-in-time snapshot of the streaming program (point in stream, state)  A "Savepoint" is a user-triggered retained checkpoint  Streaming programs can start from a savepoint 37 Savepoint B Savepoint A
  38. 38. (Re)processing data (in batch)  Re-processing data (what-if exploration, to correct bugs, etc.)  Usually by running a batch job with a set of old files  Tools that map files to times 38 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am… Collection of files, by ingestion time 2016-3-11 10:00pm To the batch processor
  39. 39. Unclear Batch Boundaries 39 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am… 2016-3-11 10:00pm To the batch processor ? ? What about sessions across batches?
  40. 40. (Re)processing data (streaming)  Draw savepoints at times that you will want to start new jobs from (daily, hourly, …)  Reprocess by starting a new job from a savepoint • Defines start position in stream (for example Kafka offsets) • Initializes pending state (like partial sessions) 40 Savepoint Run new streaming program from savepoint
  41. 41. Continuous Data Sources 41 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm … partition partition Savepoint Savepoint Stream of Kafka Partitions Stream view over sequence of files Kafka offsets + Operator state File mod timestamp + File position + Operator state WIP (target: Flink 1.1)
  42. 42. Complex Event Processing Primer Demo Time 42
  43. 43. Demo Scenario Pattern validation & violation detection:  Events should follow a certain pattern, or an alert should be raised  Think cybersecurity, process monitoring, etc 43
  44. 44. An Outlook on Things to Come 44
  45. 45. Flink in the wild 45 30 billion events daily 2 billion events in 10 1Gb machines data integration & distribution platform See talks by at
  46. 46. Roadmap  Dynamic Scaling, Resource Elasticity  Stream SQL  CEP enhancements  Incremental & asynchronous state snapshotting  Mesos support  More connectors, end-to-end exactly once  API enhancements (e.g., joins, slowly changing inputs)  Security (data encryption, Kerberos with Kafka) 46
  47. 47. 47 Apache Flink Meetup - Thursday, April, 28th
  48. 48. Flink Forward 2016, Berlin Submission deadline: June 30, 2016 Early bird deadline: July 15, 2016 www.flink-forward.org
  49. 49. We are hiring! data-artisans.com/careers
  50. 50. 50 I stream, do you?

×