Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unified Stream and Batch Processing with Apache Flink

2,930 views

Published on

Unified Stream and Batch Processing with Apache Flink

Published in: Technology
  • Be the first to comment

Unified Stream and Batch Processing with Apache Flink

  1. 1. Ufuk Celebi
 Hadoop Summit Dublin April 13, 2016 Unified 
 Stream & Batch Processing with Apache Flink
  2. 2. What is Apache Flink? 2 Apache Flink is an open source stream processing framework. • Event Time Handling • State & Fault Tolerance • Low Latency • High Throughput Developed at the Apache Software Foundation.
  3. 3. Recent History 3 April ‘14 December ‘14 v0.5 v0.6 v0.7 March ‘16 Project Incubation Top Level Project v0.8 v0.10 Release 1.0
  4. 4. Flink Stack 4 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  5. 5. Today 5 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  6. 6. Counting 6 Seemingly simple application: Count visitors, ad impressions, etc. But generalizes well to other problems.
  7. 7. Batch Processing 7 All Input Batch 
 Job All Output Hadoop, Spark, Flink
  8. 8. Batch Processing 8 DataSet<ColorEvent> counts = env .readFile("MM-dd.csv") .groupBy("color") .count();
  9. 9. Continuous Counting 9 Time 1h Job 1 Continuous ingestion Periodic files Periodic batch jobs
  10. 10. Continuous Counting 9 Time 1h Job 1 Continuous ingestion Periodic files Periodic batch jobs 1h Job 2
  11. 11. Continuous Counting 9 Time 1h Job 1 Continuous ingestion Periodic files Periodic batch jobs 1h Job 2 1h Job 3
  12. 12. Many Moving Parts 10 Batch Job 1h Serving Layer
  13. 13. Many Moving Parts 10 Batch Job 1h Serving Layer Data loading into HDFS
 (e.g. Flume)
  14. 14. Many Moving Parts 10 Batch Job 1h Serving Layer Periodic job scheduler (e.g. Oozie)
  15. 15. Many Moving Parts 10 Batch Job 1h Serving Layer Batch 
 processor (e.g. Hadoop,
 Spark, Flink)
  16. 16. High Latency 11 Latency from event to serving layer
 usually in the range of hours. Batch Job 1h Serving Layer Schedule every X hours
  17. 17. Implicit Treatment of Time 12 Time is treated outside of your application. Batch Job 1h Serving Layer
  18. 18. Implicit Treatment of Time 12 Time is treated outside of your application. Batch Job 1h Serving LayerBatch Job 1h
  19. 19. Implicit Treatment of Time 12 Time is treated outside of your application. Batch Job 1h Serving LayerBatch Job 1h Batch Job 1h
  20. 20. Implicit Treatment of Time 13 DataSet<ColorEvent> counts = env .readFile("MM-dd.csv") .groupBy("color") .count(); Time is implicit in input file
  21. 21. Batch Job Serving Layer Continuously produced Files are 
 finite streams Periodically executed Streaming over Batch 14
  22. 22. Streaming 15 Until now, stream processors were less mature
 than their batch counterparts. This led to: • in-house solutions, • abuse of batch processors, • Lambda architectures This is no longer needed with new generation 
 stream processors like Flink.
  23. 23. Streaming All the Way 16 Streaming Job Serving Layer
  24. 24. Streaming All the Way 16 Streaming Job Serving Layer Message Queue
 (e.g. Apache Kafka) Durability and Replay
  25. 25. Streaming All the Way 16 Streaming Job Serving Layer Stream Processor
 (e.g. Apache Flink) Consistent Processing
  26. 26. Building Blocks of Flink 17 Explicit Handling
 of Time State & Fault Tolerance Performance
  27. 27. Building Blocks of Flink 17 Explicit Handling
 of Time State & Fault Tolerance Performance Explicit Handling
 of Time
  28. 28. Windowing 18 Time Aggregates on streams are scoped by windows Time-driven Data-driven e.g. last X minutes e.g. last X records
  29. 29. Tumbling Windows (No Overlap) 19 Time e.g.“Count over the last 5 minutes”, 
 “Average over the last 100 records”
  30. 30. Tumbling Windows (No Overlap) 19 Time e.g.“Count over the last 5 minutes”, 
 “Average over the last 100 records”
  31. 31. Tumbling Windows (No Overlap) 19 Time e.g.“Count over the last 5 minutes”, 
 “Average over the last 100 records”
  32. 32. Tumbling Windows (No Overlap) 19 Time e.g.“Count over the last 5 minutes”, 
 “Average over the last 100 records”
  33. 33. Tumbling Windows (No Overlap) 19 Time e.g.“Count over the last 5 minutes”, 
 “Average over the last 100 records”
  34. 34. Sliding Windows (with Overlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”,
 
 “Average over the last 100 elements, updated every 10 elements”
  35. 35. Sliding Windows (with Overlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”,
 
 “Average over the last 100 elements, updated every 10 elements”
  36. 36. Sliding Windows (with Overlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”,
 
 “Average over the last 100 elements, updated every 10 elements”
  37. 37. Sliding Windows (with Overlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”,
 
 “Average over the last 100 elements, updated every 10 elements”
  38. 38. Sliding Windows (with Overlap) 20 Time e.g. “Count over the last 5 minutes, updated each minute.”,
 
 “Average over the last 100 elements, updated every 10 elements”
  39. 39. Explicit Handling of Time 21 DataStream<ColorEvent> counts = env .addSource(new KafkaConsumer(…)) .keyBy("color") .timeWindow(Time.minutes(60)) .apply(new CountPerWindow()); Time is explicit in your program
  40. 40. Session Windows 22 Time Sessions close after period of inactivity. Inactivity Inactivity e.g. “Count activity from login until time-out or logout.”
  41. 41. Session Windows 23 DataStream<ColorEvent> counts = env .addSource(new KafkaConsumer(…)) .keyBy("color") .window(EventTimeSessionWindows .withGap(Time.minutes(10)) .apply(new CountPerWindow());
  42. 42. Notions of Time 24 12:23 am Event Time Time when event happened.
  43. 43. Notions of Time 24 12:23 am Event Time 1:37 pm Processing Time Time measured by system clock Time when event happened.
  44. 44. 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode
 IV Episode
 V Episode
 VI Episode
 I Episode
 II Episode
 III Episode
 VII Event Time Out of Order Events 25
  45. 45. Out of Order Events 26 1st burst of events 2nd burst of events Event Time Windows Processing Time Windows
  46. 46. Notions of Time 27 env.setStreamTimeCharacteristic( TimeCharacteristic.EventTime);
 DataStream<ColorEvent> counts = env ... .timeWindow(Time.minutes(60)) .apply(new CountPerWindow());
  47. 47. Explicit Handling of Time 28 1. Expressive windowing 2. Accurate results for out of order data 3. Deterministic results
  48. 48. Building Blocks of Flink 29 Explicit Handling
 of Time State & Fault Tolerance Performance
  49. 49. Stateful Streaming 30 Stateless Stream
 Processing Stateful Stream
 Processing Op Op State
  50. 50. Processing Semantics 31 At-least once May over-count after failure Exactly Once Correct counts after failures End-to-end exactly once Correct counts in external system (e.g. DB, file system) after failure
  51. 51. Processing Semantics 32 • Flink guarantees exactly once (can be configured
 for at-least once if desired) • End-to-end exactly once with specific sources
 and sinks (e.g. Kafka -> Flink -> HDFS) • Internally, Flink periodically takes consistent
 snapshots of the state without ever stopping
 computation
  52. 52. Building Blocks of Flink 33 Explicit Handling
 of Time State & Fault Tolerance Performance
  53. 53. Yahoo! Benchmark 34 • Storm 0.10, Spark Streaming 1.5, and Flink 0.10
 benchmark by Storm team at Yahoo! • Focus on measuring end-to-end latency 
 at low throughputs (~ 200k events/sec) • First benchmark modelled after a real application https://yahooeng.tumblr.com/post/135321837876/
 benchmarking-streaming-computation-engines-at
  54. 54. Yahoo! Benchmark 35 • Count ad impressions grouped by campaign • Compute aggregates over last 10 seconds • Make aggregates available for queries in Redis
  55. 55. 99th Percentile
 Latency (sec) 9 8 2 1 Storm 0.10 Flink 0.10 60 80 100 120 140 160 180 Throughput
 (1000 events/sec) Storm and Flink at
 low latencies Latency (Lower is Better) 36
  56. 56. 99th Percentile
 Latency (sec) 9 8 2 1 Storm 0.10 Flink 0.10 60 80 100 120 140 160 180 Throughput
 (1000 events/sec) Spark Streaming 1.5 Spark latency increases
 with throughput Storm and Flink at
 low latencies Latency (Lower is Better) 36
  57. 57. Extending the Benchmark 37 • Great starting point, but benchmark stops at 
 low write throughput and programs are not
 fault-tolerant • Extend benchmark to high volumes and 
 Flink’s built-in fault-tolerant state http://data-artisans.com/extending-the-yahoo-streaming- benchmark/
  58. 58. Extending the Benchmark 38 Use Flink’s internal state
  59. 59. Throughput (Higher is Better) 39 5.000.000 10.000.000 15.000.000 Maximum Throughput (events/sec) 0 Flink w/ Kafka Storm w/ Kafka
  60. 60. Throughput (Higher is Better) 39 5.000.000 10.000.000 15.000.000 Maximum Throughput (events/sec) 0 Flink w/ Kafka Storm w/ Kafka Limited by bandwidth between
 Kafka and Flink cluster
  61. 61. Throughput (Higher is Better) 39 5.000.000 10.000.000 15.000.000 Maximum Throughput (events/sec) 0 Flink w/o Kafka Flink w/ Kafka Storm w/ Kafka Limited by bandwidth between
 Kafka and Flink cluster
  62. 62. Summary 40 • Stream processing is gaining momentum, the right paradigm for continuous data applications • Choice of framework is crucial – even seemingly simple applications become complex at scale and in production • Flink offers unique combination of efficiency, consistency and event time handling
  63. 63. Libraries 41 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries
 Complex Event Processing (CEP), ML, Graphs
  64. 64. 42 Pattern<MonitoringEvent, ?> warningPattern = 
 Pattern.<MonitoringEvent>begin("First Event") .subtype(TemperatureEvent.class) .where(evt -> evt.getTemperature() >= THRESHOLD) .next("Second Event") .subtype(TemperatureEvent.class) .where(evt -> evt.getTemperature() >= THRESHOLD) .within(Time.seconds(10)); Complex Event Processing (CEP)
  65. 65. Upcoming Features 43 • SQL: ongoing work in collaboration with Apache Calcite • Dynamic Scaling: adapt resources to stream volume, scale up for historical stream processing • Queryable State: query the state inside the stream processor

  66. 66. SQL 44 SELECT STREAM * FROM Orders WHERE units > 3; rowtime | productId | orderId | units ----------+-----------+---------+------- 10:17:00 | 30 | 5 | 4 10:18:07 | 30 | 8 | 20 11:02:00 | 10 | 9 | 6 11:09:30 | 40 | 11 | 12 11:24:11 | 10 | 12 | 4
 … | … | … | …
  67. 67. key­value states have to be redistributed when rescaling a Flink job. Distributing the key­value  states coherently with the job’s new partitioning will lead to a consistent state.      Dynamic Scaling 45
  68. 68. Queryable State 46 Query Flink directly
  69. 69. Join the Community 47 Read http://flink.apache.org/blog http://data-artisans.com/blog Follow
 @ApacheFlink
 @dataArtisans Subscribe (news | user | dev)@flink.apache.org

×