Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Continuous Processing with Apache Flink - Strata London 2016

10,995 views

Published on

Task from the Strata & Hadoop World conference in London, 2016: Apache Flink and Continuous Processing.

The talk discusses some of the shortcomings of building continuous applications via batch processing, and how a stream processing architecture naturally solves many of these issues.

Published in: Software
  • Be the first to comment

Continuous Processing with Apache Flink - Strata London 2016

  1. 1. Stephan Ewen @stephanewen Continuous Processing with Apache Flink
  2. 2. 2 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced
  3. 3. Continuous Apps before Streaming 3 time Scheduler file 1 file 2 Job 1 Job 2 Serving file 3 Job 3
  4. 4. Continuous Apps with Lambda 4 Scheduler file 1 file 2 Job 1 Job 2 Serving Streaming job
  5. 5. Continuous Apps with Streaming 5 collect log analyze serve & store
  6. 6. Continuous Data Sources 6 Process a period of historic data partition partition Process latest data with low latency (tail of the log) Reprocess stream (historic data first, catches up with realtime data)
  7. 7. Continuous Data Sources 7 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am… partition partition Stream of events in Apache Kafka partitions Stream view over sequence of files
  8. 8. Continuous Processing Time State
  9. 9. Enter Apache Flink 9
  10. 10. Apache Flink Stack 10 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  11. 11. Programs and Dataflows 11 Source Transformation Transformation Sink val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Source [1] map() [1] keyBy()/ window()/ apply() [1] Sink [1] Source [2] map() [2] keyBy()/ window()/ apply() [2] Streaming Dataflow
  12. 12. What makes Flink flink? 12 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  13. 13. (It's) About Time 13
  14. 14. Different Notions of Time 14 Event Producer Message Queue Flink Data Source Flink Window Operator partition 1 partition 2 Event Time Stream Processor Ingestion Time Window Processing Time Storage Ingestion Time
  15. 15. 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time Event Time vs. Processing Time 15
  16. 16. Batch: Implicit Treatment of Time 16 Time is treated outside of your application. Data is grouped by storage ingestion time. Batch Job 1h Serving LayerBatch Job 1h Batch Job 1h
  17. 17. Streaming: Windows 17 Time Aggregates on streams are scoped by windows Time-driven Data-driven e.g. last X minutes e.g. last X records
  18. 18. Streaming: Windows 18 Time "Average over the last 5 minutes”
  19. 19. Event Time Windows 19 Event Time Windows reorder the events to their Event Time order
  20. 20. Processing Time 20 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(ProcessingTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  21. 21. Ingestion Time 21 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(IngestionTime) val stream: DataStream[Event] = env.addSource(…) stream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  22. 22. Event Time 22 case class Event(id: String, measure: Double, timestamp: Long) val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(EventTime) val stream: DataStream[Event] = env.addSource(…) val tsStream = stream.assignTimestampsAndWatermarks( new MyTimestampsAndWatermarkGenerator()) tsStream .keyBy("id") .timeWindow(Time.seconds(15), Time.seconds(5)) .sum("measure")
  23. 23. The Power of Event Time  Batch Processors: Event-time in ingestion-time batches • Stable across re-executions • Wrong grouping at batch boundaries  Traditional Stream Processors: Processing time • Results depend on when the program runs (different on re-execution) • Results affected by network speed and delays  Event-Time Stream Processors: Event time • Stable across re-executions • No incorrect results at batch boundaries 23
  24. 24. The Power of Event Time  Batch Processors: Event-time in ingestion-time batches • Stable across re-executions • Wrong grouping at batch boundaries  Traditional Stream Processors: Processing time • Results depend on when the program runs (different on re-execution) • Results affected by network speed and delays  Event-Time Stream Processors: Event time • Stable across re-executions • No incorrect results at batch boundaries 24 Purely data-driven time Purely wall clock time Mix of data-driven and wall clock time
  25. 25. Event Time Progress: Watermarks 25 7 W(11)W(17) 11159121417122220 171921 Watermark Event Event timestamp Stream (in order) 7 W(11)W(20) Watermark 991011141517 Event Event timestamp 1820 192123 Stream (out of order)
  26. 26. Bounding the Latency for Results  Triggering on combinations on Event Time and Processing Time  See previous talks by Tyler Akidau & Kenneth Knowles on Apache Beam (incub.)  Concepts apply almost 1:1 to Apache Flink  Syntax varies 26
  27. 27. Matters of State 27
  28. 28. Batch vs. Continuous 28 • No state across batches • Fault tolerance within a job • Re-processing starts empty Batch Jobs Continuous Programs • Continuous state across time • Fault tolerance guards state • Reprocessing starts stateful
  29. 29. Continuous State 29 time No stateless point in time Sessions over time
  30. 30. Re-processing data (in batch) 30 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-1 5:00am 2016-3-1 6:00am 2016-3-1 7:00am 2016-3-1 4:00am 2016-3-1 3:00 am
  31. 31. Re-processing data (in batch) 31 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-1 5:00am 2016-3-1 6:00am 2016-3-1 7:00am 2016-3-1 4:00am 2016-3-1 3:00 am Wrong / corrupt results
  32. 32. Streaming: Savepoints 32 Savepoint A Savepoint B Globally consistent point-in-time snapshot of the streaming application
  33. 33. Re-processing data (continuous) 33 Savepoint A
  34. 34. Re-processing data (continuous)  Draw savepoints at times that you will want to start new jobs from (daily, hourly, …)  Reprocess by starting a new job from a savepoint • Defines start position in stream (for example Kafka offsets) • Initializes pending state (like partial sessions) 34 Savepoint Run new streaming program from savepoint
  35. 35. Forking and Versioning Applications 35 Savepoint Savepoint Savepoint Savepoint App. A App. B App. C
  36. 36. Conclusion 36
  37. 37. Wrap up  Streaming is the architecture for continuous processing  Continuous processing makes data applications • Simpler: Fewer moving parts • More correct: No broken state at any boundaries • More flexible: Reprocess data and fork applications via savepoints  Requires a powerful stream processor, like Apache Flink 37
  38. 38. Upcoming Features  Dynamic Scaling, Resource Elasticity  Stream SQL  CEP enhancements  Incremental & asynchronous state snapshotting  Mesos support  More connectors, end-to-end exactly once  API enhancements (e.g., joins, slowly changing inputs)  Security (data encryption, Kerberos with Kafka) 38
  39. 39. What makes Flink flink? 39 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  40. 40. Flink Forward 2016, Berlin Submission deadline: June 30, 2016 Early bird deadline: July 15, 2016 www.flink-forward.org
  41. 41. We are hiring! data-artisans.com/careers

×