Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flink @ NYC Flink Meetup

1,396 views

Published on

Streaming Analytics (Realtime + Continuous Applications) with Apache Flink at the NYC Flink meetup. Talks about semantics of time and state

Published in: Technology
  • Be the first to comment

Apache Flink @ NYC Flink Meetup

  1. 1. Stephan Ewen @stephanewen Streaming Analytics with Apache Flink 1.0
  2. 2. Apache Flink Stack 2 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  3. 3. Today 3 Streaming and batch as first class citizens. DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries
  4. 4. 4 Streaming is the next programming paradigm for data applications, and you need to start thinking in terms of streams.
  5. 5. 5 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  6. 6. Continuous Processing with Batch  Continuous ingestion  Periodic (e.g., hourly) files  Periodic batch jobs 6
  7. 7. λ Architecture  "Batch layer": what we had before  "Stream layer": approximate early results 7
  8. 8. A Stream Processing Pipeline 8 collect log analyze serve & store
  9. 9. A brief History of Flink 9 January ‘10 December ‘14 v0.5 v0.6 v0.7 March ‘16 Flink Project Incubation Top Level Project v0.8 v0.10 Release 1.0 Project Stratosphere (Flink precursor) v0.9 April ‘14
  10. 10. A brief History of Flink 10 January ‘10 December ‘14 v0.5 v0.6 v0.7 March ‘16 Flink Project Incubation Top Level Project v0.8 v0.10 Release 1.0 Project Stratosphere (Flink precursor) v0.9 April ‘14 The academia gap: Reading/writing papers, teaching, worrying about thesis Realizing this might be interesting to people beyond academia (even more so, actually)
  11. 11. Programs and Dataflows 11 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Source [2] map() [2] keyBy()/ window()/ apply() [2] Streaming Dataflow
  12. 12. What makes Flink flink? 12 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  13. 13. Streaming Analytics by Example 13
  14. 14. Time-Windowed Aggregations 14 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum("measure")
  15. 15. Time-Windowed Aggregations 15 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .timeWindow(Time.seconds(60), Time.seconds(5)) .sum("measure")
  16. 16. Session-Windowed Aggregations 16 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .window(EventTimeSessionWindows.withGap(Time.seconds(60))) .max("measure")
  17. 17. Session-Windowed Aggregations 17 case class Event(sensor: String, measure: Double) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("sensor") .window(EventTimeSessionWindows.withGap(Time.seconds(60))) .max("measure") Flink 1.1 syntax
  18. 18. Pattern Detection 18 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) } })
  19. 19. Pattern Detection 19 case class Event(producer: String, evtType: Int, msg: String) case class Alert(msg: String) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("producer") .flatMap(new RichFlatMapFuncion[Event, Alert]() { lazy val state: ValueState[Int] = getRuntimeContext.getState(…) def flatMap(event: Event, out: Collector[Alert]) = { val newState = state.value() match { case 0 if (event.evtType == 0) => 1 case 1 if (event.evtType == 1) => 0 case x => out.collect(Alert(event.msg, x)); 0 } state.update(newState) } }) Embedded key/value state store
  20. 20. Many more  Joining streams (e.g. combine readings from sensor)  Detecting Patterns (CEP)  Applying (changing) rules or models to events  Training and applying online machine learning models  … 20
  21. 21. (It's) About Time 21
  22. 22. 22 The biggest change in moving from batch to streaming is handling time explicitly
  23. 23. Example: Windowing by Time 23 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  24. 24. Example: Windowing by Time 24 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  25. 25. Different Notions of Time 25 Event Producer Message Queue Flink Data Source Flink Window Operator partition 1 partition 2 Event Time Ingestion Time Window Processing Time
  26. 26. 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time Event Time vs. Processing Time 26
  27. 27. Out of order Streams 27 Events occur on devices Queue / Log Events analyzed in a data streaming system Stream Analysis Events stored in a log
  28. 28. Out of order Streams 28
  29. 29. Out of order Streams 29
  30. 30. Out of order Streams 30
  31. 31. Out of order Streams 31 Out of order !!! First burst of events Second burst of events
  32. 32. Out of order Streams 32 Event time windows Arrival time windows Instant event-at-a-time First burst of events Second burst of events
  33. 33. Processing Time 33 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(ProcessingTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure") Window by operator's processing time
  34. 34. Ingestion Time 34 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(IngestionTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  35. 35. Event Time 35 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  36. 36. Event Time 36 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignAscendingTimestamps(_.timestamp) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  37. 37. Event Time 37 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator()) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  38. 38. Watermarks 38 7 W(11)W(17) 11159121417122220 171921 Watermark Event Event timestamp Stream (in order) 7 W(11)W(20) Watermark 991011141517 Event Event timestamp 1820 192123 Stream (out of order)
  39. 39. Watermarks in Parallel 39 Source (1) Source (2) map (1) map (2) window (1) window (2) 29 29 17 14 14 29 14 14 W(33) W(17) W(17) A|30B|31 C|30 D|15 E|30 F|15G|18H|20 K|35 Watermark Event Time at the operator Event [id|timestamp] Event Time at input streams 33 17 Watermark Generation M|39N|39Q|44 L|22O|23R|37
  40. 40. Mixing Event Time Processing Time 40 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignAscendingTimestamps(_.timestamp) tsStream .keyBy("id") .window(SlidingEventTimeWindows.of(seconds(15), seconds(5)) .trigger(new MyTrigger()) .sum("measure")
  41. 41. Window Triggers  React to any combination of • Event Time • Processing Time • Event data  Example of a mixed EventTime / Proc. Time Trigger: • Trigger when event time reaches window end OR • When processing time reaches window end plus 30 secs. 41
  42. 42. Trigger example 42 .sum("measure") public class EventTimeTrigger extends Trigger<Object, TimeWindow> { public TriggerResult onElement(Object evt, long time, TimeWindow window, TriggerContext ctx) { ctx.registerEventTimeTimer(window.maxTimestamp()); ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000); return TriggerResult.CONTINUE; } public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) { return TriggerResult.FIRE_AND_PURGE; } public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) { return TriggerResult.FIRE_AND_PURGE; }
  43. 43. Trigger example 43 .sum("measure") public class EventTimeTrigger extends Trigger<Object, TimeWindow> { public TriggerResult onElement(Object evt, long time, TimeWindow window, TriggerContext ctx) { ctx.registerEventTimeTimer(window.maxTimestamp()); ctx.registerProcessingTimeTimer(window.maxTimestamp() + 30000); return TriggerResult.CONTINUE; } public TriggerResult onEventTime(long time, TimeWindow w, TriggerContext ctx) { return TriggerResult.FIRE_AND_PURGE; } public TriggerResult onProcessingTime(long time, TimeWindow w, TriggerContext c) { return TriggerResult.FIRE_AND_CONTINUE; }
  44. 44. Per Kafka Partition Watermarks 44 Source (1) Source (2) map (1) map (2) window (1) window (2) 33 17 29 29 17 14 14 29 14 14 W(33) W(17) W(17) A|30B|73 C|33 D|18 E|31 F|15G|91H|94 K|77 Watermark Generation L|35N|39 O|97 M|89 I|21Q|23 T|99 S|97
  45. 45. Per Kafka Partition Watermarks 45 val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val kafka = new FlinkKafkaConsumer09(topic, schema, props) kafka.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator()) val stream: DataStream[Event] = env.addSource(kafka) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  46. 46. Matters of State (Fault Tolerance, Reinstatements, etc) 46
  47. 47. Back to the Aggregation Example 47 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment val stream: DataStream[Event] = env.addSource( new FlinkKafkaConsumer09(topic, schema, properties)) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure") Stateful
  48. 48. Fault Tolerance  Prevent data loss (reprocess lost in-flight events)  Recover state consistency (exactly-once semantics) • Pending windows & user-defined (key/value) state  Checkpoint based fault tolerance • Periodicaly create checkpoints • Recovery: resume from last completed checkpoint • Async. Barrier Snapshots (ABS) Algorithm  48
  49. 49. Checkpoints 49 data stream event newer records older records State of the dataflow at point Y State of the dataflow at point X
  50. 50. Checkpoint Barriers  Markers, injected into the streams 50
  51. 51. Checkpoint Procedure 51
  52. 52. Checkpoint Procedure 52
  53. 53. Savepoints  A "Checkpoint" is a globally consistent point-in-time snapshot of the streaming program (point in stream, state)  A "Savepoint" is a user-triggered retained checkpoint  Streaming programs can start from a savepoint 53 Savepoint B Savepoint A
  54. 54. (Re)processing data (in batch)  Re-processing data (what-if exploration, to correct bugs, etc.)  Usually by running a batch job with a set of old files  Tools that map files to times 54 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am… Collection of files, by ingestion time 2016-3-11 10:00pm To the batch processor
  55. 55. Unclear Batch Boundaries 55 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am… 2016-3-11 10:00pm To the batch processor ? ? What about sessions across batches?
  56. 56. (Re)processing data (streaming)  Draw savepoints at times that you will want to start new jobs from (daily, hourly, …)  Reprocess by starting a new job from a savepoint • Defines start position in stream (for example Kafka offsets) • Initializes pending state (like partial sessions) 56 Savepoint Run new streaming program from savepoint
  57. 57. Continuous Data Sources 57 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm … partition partition Savepoint Savepoint Stream of Kafka Partitions Stream view over sequence of files Kafka offsets + Operator state File mod timestamp + File position + Operator state WIP (target: Flink 1.1)
  58. 58. Upgrading Programs  A program starting from a savepoint can differ from the program that created the savepoint • Unique operator names match state and operator  Mechanism be used to fix bugs in programs, to evolve programs, parameters, libraries, … 58
  59. 59. State Backends  Large state is a collection of key/value pairs  State backend defines what data structure holds the state, plus how it is snapshotted  Most common choices • Main memory – snapshots to master • Main memory – snapshots to dist. filesystem • RocksDB – snapshots to dist. filesystem 59
  60. 60. Complex Event Processing Primer 60
  61. 61. Example: Temperature Monitoring  Receiving temperature an power events from sensors  Looking for temperatures repeatedly exceeding thresholds within a short time period (10 secs) 61
  62. 62. Event Types 62
  63. 63. Defining Patterns 63
  64. 64. Generating Alerts 64
  65. 65. An Outlook on Things to Come 65
  66. 66. Flink in the wild 66 30 billion events daily 2 billion events in 10 1Gb machines data integration & distribution platform See talks by at
  67. 67. Roadmap  Dynamic Scaling, Resource Elasticity  Stream SQL  CEP enhancements  Incremental & asynchronous state snapshotting  Mesos support  More connectors, end-to-end exactly once  API enhancements (e.g., joins, slowly changing inputs)  Security (data encryption, Kerberos with Kafka) 67
  68. 68. 68 I stream, do you?

×