Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

1,204 views

Published on

Slides for my talk at Hadoop Summit Dublin, April 2016.

The talk motivates how streaming can subsume batch use cases at the example of continuous counting.

Published in: Technology
  • Be the first to comment

Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

  1. 1. Ufuk Celebi
 Hadoop Summit Dublin April 13, 2016 Unified 
 Stream & Batch Processing with Apache Flink
  2. 2. What is Apache Flink? 2 Apache Flink is an open source stream processing framework. • Event Time Handling • State & Fault Tolerance • Low Latency • High Throughput Developed at the Apache Software Foundation.
  3. 3. Recent History 3 April ‘14 December ‘14 v0.5 v0.6 v0.7 March ‘16 Project Incubation Top Level Project v0.8 v0.10 Release 1.0
  4. 4. Flink Stack 4 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries Streaming and batch as first class citizens.
  5. 5. Counting 5 Seemingly simple application: Count visitors, ad impressions, etc. But generalizes well to other problems.
  6. 6. Batch Processing 6 All Input Batch 
 Job All Output Hadoop, Spark, Flink
  7. 7. Batch Processing 7 DataSet<ColorEvent> counts = env .readFile("MM-dd.csv") .groupBy("color") .count();
  8. 8. Continuous Counting 8 Time 1h Job 1 Continuous ingestion Periodic files Periodic batch jobs 1h Job 2 1h Job 3
  9. 9. Many Moving Parts 9 Batch Job 1h Serving Layer Periodic job scheduler (e.g. Oozie) Data loading into HDFS
 (e.g. Flume) Batch 
 processor (e.g. Hadoop,
 Spark, Flink)
  10. 10. High Latency 10 Latency from event to serving layer
 usually in the range of hours. Batch Job 1h Serving Layer Schedule every X hours
  11. 11. Implicit Treatment of Time 11 Time is treated outside of your application. Batch Job 1h Serving LayerBatch Job 1h Batch Job 1h
  12. 12. Implicit Treatment of Time 12 DataSet<ColorEvent> counts = env .readFile("MM-dd.csv") .groupBy("color") .count(); Time is implicit in input file
  13. 13. Batch Job Serving Layer Continuously produced Files are 
 finite streams Periodically executed Streaming over Batch 13
  14. 14. Streaming 14 Until now, stream processors were less mature
 than their batch counterparts. This led to: • in-house solutions, • abuse of batch processors, • Lambda architectures This is no longer needed with new generation 
 stream processors like Flink.
  15. 15. Streaming All the Way 15 Streaming Job Serving Layer Message Queue
 (e.g. Apache Kafka) Durability and Replay Stream Processor
 (e.g. Apache Flink) Consistent Processing
  16. 16. Building Blocks of Flink 16 Explicit Handling
 of Time State & Fault Tolerance Performance
  17. 17. Windowing 17 Time Aggregates on streams are scoped by windows Time-driven Data-driven e.g. last X minutes e.g. last X records
  18. 18. Tumbling Windows (No Overlap) 18 Time e.g.“Count over the last 5 minutes”, 
 “Average over the last 100 records”
  19. 19. Sliding Windows (with Overlap) 19 Time e.g. “Count over the last 5 minutes, updated each minute.”,
 
 “Average over the last 100 elements, updated every 10 elements”
  20. 20. Explicit Handling of Time 20 DataStream<ColorEvent> counts = env .addSource(new KafkaConsumer(…)) .keyBy("color") .timeWindow(Time.minutes(60)) .apply(new CountPerWindow()); Time is explicit in your program
  21. 21. Session Windows 21 Time Sessions close after period of inactivity. Inactivity Inactivity e.g. “Count activity from login until time-out or logout.”
  22. 22. Session Windows 22 DataStream<ColorEvent> counts = env .addSource(new KafkaConsumer(…)) .keyBy("color") .window(EventTimeSessionWindows .withGap(Time.minutes(10)) .apply(new CountPerWindow());
  23. 23. Notions of Time 23 12:23 am Event Time 1:37 pm Processing Time Time measured by system clock Time when event happened.
  24. 24. 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode
 IV Episode
 V Episode
 VI Episode
 I Episode
 II Episode
 III Episode
 VII Event Time Out of Order Events 24
  25. 25. Out of Order Events 25 1st burst of events 2nd burst of events Event Time Windows Processing Time Windows
  26. 26. Notions of Time 26 env.setStreamTimeCharacteristic( TimeCharacteristic.EventTime);
 DataStream<ColorEvent> counts = env ... .timeWindow(Time.minutes(60)) .apply(new CountPerWindow());
  27. 27. Explicit Handling of Time 27 1. Expressive windowing 2. Accurate results for out of order data 3. Deterministic results
  28. 28. Stateful Streaming 28 Stateless Stream
 Processing Stateful Stream
 Processing Op Op State
  29. 29. Processing Semantics 29 At-least once May over-count after failure Exactly Once Correct counts after failures End-to-end exactly once Correct counts in external system (e.g. DB, file system) after failure
  30. 30. Processing Semantics 30 • Flink guarantees exactly once (can be configured
 for at-least once if desired) • End-to-end exactly once with specific sources
 and sinks (e.g. Kafka -> Flink -> HDFS) • Internally, Flink periodically takes consistent
 snapshots of the state without ever stopping
 computation
  31. 31. Yahoo! Benchmark 31 • Storm 0.10, Spark Streaming 1.5, and Flink 0.10
 benchmark by Storm team at Yahoo! • Focus on measuring end-to-end latency 
 at low throughputs (~ 200k events/sec) • First benchmark modelled after a real application https://yahooeng.tumblr.com/post/135321837876/
 benchmarking-streaming-computation-engines-at
  32. 32. Yahoo! Benchmark 32 • Count ad impressions grouped by campaign • Compute aggregates over last 10 seconds • Make aggregates available for queries in Redis
  33. 33. 99th Percentile
 Latency (sec) 9 8 2 1 Storm 0.10 Flink 0.10 60 80 100 120 140 160 180 Throughput
 (1000 events/sec) Spark Streaming 1.5 Spark latency increases
 with throughput Storm and Flink at
 low latencies Latency (Lower is Better) 33
  34. 34. Extending the Benchmark 34 • Great starting point, but benchmark stops at 
 low write throughput and programs are not
 fault-tolerant • Extend benchmark to high volumes and 
 Flink’s built-in fault-tolerant state http://data-artisans.com/extending-the-yahoo-streaming- benchmark/
  35. 35. Extending the Benchmark 35 Use Flink’s internal state
  36. 36. Throughput (Higher is Better) 36 5.000.000 10.000.000 15.000.000 Maximum Throughput (events/sec) 0 Flink w/o Kafka Flink w/ Kafka Storm w/ Kafka Limited by bandwidth between
 Kafka and Flink cluster
  37. 37. Summary 37 • Stream processing is gaining momentum, the right paradigm for continuous data applications • Choice of framework is crucial – even seemingly simple applications become complex at scale and in production • Flink offers unique combination of efficiency, consistency and event time handling
  38. 38. Libraries 38 DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Libraries
 Complex Event Processing (CEP), ML, Graphs
  39. 39. 39 Pattern<MonitoringEvent, ?> warningPattern = 
 Pattern.<MonitoringEvent>begin("First Event") .subtype(TemperatureEvent.class) .where(evt -> evt.getTemperature() >= THRESHOLD) .next("Second Event") .subtype(TemperatureEvent.class) .where(evt -> evt.getTemperature() >= THRESHOLD) .within(Time.seconds(10)); Complex Event Processing (CEP)
  40. 40. Upcoming Features 40 • SQL: ongoing work in collaboration with Apache Calcite • Dynamic Scaling: adapt resources to stream volume, scale up for historical stream processing • Queryable State: query the state inside the stream processor

  41. 41. SQL 41 SELECT STREAM * FROM Orders WHERE units > 3; rowtime | productId | orderId | units ----------+-----------+---------+------- 10:17:00 | 30 | 5 | 4 10:18:07 | 30 | 8 | 20 11:02:00 | 10 | 9 | 6 11:09:30 | 40 | 11 | 12 11:24:11 | 10 | 12 | 4
 … | … | … | …
  42. 42. key­value states have to be redistributed when rescaling a Flink job. Distributing the key­value  states coherently with the job’s new partitioning will lead to a consistent state.      Dynamic Scaling 42
  43. 43. Queryable State 43 Query Flink directly
  44. 44. Join the Community 44 Read http://flink.apache.org/blog http://data-artisans.com/blog Follow
 @ApacheFlink
 @dataArtisans Subscribe (news | user | dev)@flink.apache.org

×