Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis

166 views

Published on

This is our presentation for the German paper "Die Apache Flink Plattform zur parallelen Analyse von Datenströmen und Stapeldaten" which was published in Proceedings of the
LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9. October 2015. Link: http://ceur-ws.org/Vol-1458/H02_CRC79_Traub.pdf

Published in: Software
  • Be the first to comment

LWA 2015: The Apache Flink Platform for Parallel Batch and Stream Analysis

  1. 1. Technische Universität Berlin DIMA – Databases and Information Management Group The Apache Flink Platform for Parallel Batch and Stream Analysis Jonas Traub | Tilmann Rabl | Fabian Hueske | Till Rohrmann | Volker Markl
  2. 2. In this talk  Apache Flink Primer • Architecture • Execution Engine • API Examples  Stream Processing with Apache Flink • Micro Batching vs. Native Streaming • Flexible Windows/Stream Discretization • Fault Tolerance with distributed snapshotting  Conclusion 2Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  3. 3. Apache Flink Primer 3
  4. 4. What is Flink? 4 A platform for distributed batch and streaming analytics Streaming dataflow runtime Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  5. 5. Flink in the Analytics Ecosystem 55 MapReduce Hive Flink Spark Storm Yarn Mesos HDFS Mahout Cascading Tez Pig Data processing engines App and resource management Applications Storage, streams KafkaHBase Crunch … Giraph 5 Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  6. 6. What can I do with it? 6 An engine that can natively support all these workloads. Flink Stream processing Batch processing Machine Learning at scale Graph Analysis Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  7. 7. Sneak peak: Two of Flink’s APIs 7 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .keyBy("word") .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .sum("frequency”) .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming): Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  8. 8. Execution Model  Flink program = DAG* of operators and intermediate results  Operator = computation + state  Intermediate result = logical stream of records 8 map join sum Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  9. 9. Architecture  Pipelined/Streaming engine • Complete DAG deployed Worker 1 Worker 3 Worker 4 Worker 2 Job Manager 9Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  10. 10. Flink Stream Processing 10
  11. 11. Ingredients of a Streaming System  Pipelined Execution Engine  Streaming Windows/Discretization  Fault Tolerance  High Level Programming API (or language) 11Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  12. 12. Micro Batching vs Native Streaming 12 Stream discretizer Job Job Job Jobwhile (true) { // get next few records // issue batch computation } Discretized Streams (D-Streams) Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  13. 13. Micro Batching vs Native Streaming 13 Stream discretizer Job Job Job Jobwhile (true) { // get next few records // issue batch computation } while (true) { // process next record } Long-standing operators Discretized Streams (D-Streams) Native streaming Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  14. 14. Stream Discretization  Data is unbounded • Interested in a (recent) part of it e.g. last 10 days  Most common windows around: time, and count • Mostly in sliding, fixed, and tumbling form  Need for data-driven window definitions • e.g., user sessions (periods of user activity followed by inactivity), price changes, etc. 14 The world beyond batch: Streaming 101, Tyler Akidau https://beta.oreilly.com/ideas/the-world-beyond-batch- streaming-101 Great read! Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  15. 15. Flink’s Discretization  Allows very flexible windowing  Borrows ideas, and extends IBM’s SPL • SLIDE = Trigger = When to emit a window • RANGE = Eviction = What the window contains  Allows for lots of optimization • Not part of this talk 15Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  16. 16. The Discretizer Operator 16 Streams are represented as FIFO-Queue of data-items The window operator keeps a FIFO-Buffer After some time, data-items expire (they are deleted) Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  17. 17. The Discretizer Operator 17 The window operator is event driven by data-item arrivals Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  18. 18. The Discretizer Operator 18 The window operator is event driven by data-item arrivals 1.) Trigger Policies (TPs) Specify when to emit the current buffer content as a window. 2.) Eviction Policies (EPs) Specify when data-items are removed from the buffer. Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  19. 19. The Discretizer Operator 19 1.) Trigger Policies (TPs) Specify when to emit the current buffer content as a window. 2.) Eviction Policies (EPs) Specify when data-items are removed from the buffer. Query Example (window of size 3): dataStream.window(Count.of(3)) Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  20. 20. The Discretizer Operator 20 2.) Eviction Policies (EPs) Specify when data-items are removed from the buffer. 1.) Trigger Policies (TPs) Specify when to emit the current buffer content as a window. Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  21. 21. The Discretizer Operator 21 1.) Trigger Policies (TPs) Specify when to emit the current buffer content as a window. 2.) Eviction Policies (EPs) Specify when data-items are removed from the buffer. Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  22. 22. The Discretizer Operator 22 1.) Trigger Policies (TPs) Specify when to emit the current buffer content as a window. 2.) Eviction Policies (EPs) Specify when data-items are removed from the buffer. 1.) Trigger Policies (TPs) Specify when to emit the current buffer content as a window. Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  23. 23. The Discretizer Operator 23 1.) Trigger Policies (TPs) Specify when to emit the current buffer content as a window. 2.) Eviction Policies (EPs) Specify when data-items are removed from the buffer. 2.) Eviction Policies (EPs) Specify when data-items are removed from the buffer. Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  24. 24. Flexible Windowing  Windows can be any combination of (multiple) triggers & evictions • Arbitrary tumbling, sliding, session, etc. windows can be constructed.  Common triggers/evictions part of the API • Time, Count & Delta.  Even more flexibility: define your own UDF trigger/eviction 24Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  25. 25. Fault Tolerance and Operator State 25
  26. 26. Comparing Fault Tolerance Solutions • Based on consistent global snapshots • Algorithm inspired by Chandy-Lamport • Low runtime overhead • Stateful exactly-once semantics Message tracking/acks (at least once guarantee) RDD re-computation 26Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  27. 27. Example: A Stateful Map (counter) 27 public class Counter implements MapFunction<Long>, Checkpointed<Long> { //persistent counter private long counter = 0; public Long map(Long value){ return ++counter; } Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  28. 28. Example: A Stateful Map (counter) 28 public class Counter implements MapFunction<Long>, Checkpointed<Long> { //persistent counter private long counter = 0; public Long map(Long value){ return ++counter; } // regularly persists state during normal operation public Serializable snapshotState(long checkpointId, long checkpointTimestamp){ return new Long(counter); } // restores state on recovery from failure public void restoreState(Serializable state){ counter = (Long) state; } } Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  29. 29. Distributed Snapshots reset from snap t2 t3t2t1 snap - t1 snap - t2 Assumptions • repeatable sources • reliable FIFO channels 29Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  30. 30. Taking Snapshots reset from snap t2 t3t2t1 snap - t1 snap - t2 Initial approach (e.g.,Naiad) • Pause execution on t1,t2,.. • Collect state • Restore execution 30Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  31. 31. Asynchronous Snapshots in Flink [Carbone et. al. 2015] “Lightweight Asynchronous Snapshots for Distributed Dataflows”, Tech. Report. http://arxiv.org/abs/1506.08603 Push checkpoint barriers through the data flow Data Stream barrier Before barrier  part of the snapshot After barrier  Not in snapshot (backup till next snapshot) 31Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  32. 32. Asynchronous Snapshots in Flink Push checkpoint barriers through the data flow Data Stream barrier Before barrier  part of the snapshot After barrier  Not in snapshot (backup till next snapshot) Operator checkpoint starting Checkpoint done Checkpoint done checkpoint in progress 32Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015 [Carbone et. al. 2015] “Lightweight Asynchronous Snapshots for Distributed Dataflows”, Tech. Report. http://arxiv.org/abs/1506.08603
  33. 33. Closing 33
  34. 34. Community 34 Flink started as the Stratosphere project in in 2009, led by TU Berlin. Entered incubation April 2014 graduated on December 2014. Now one of the most active big data projects after over a year in the Apache Software Foundation. Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  35. 35. tl;dr: what was this about? • The Berlin Big Data Center • Native Streaming with Apache Flink • Flexible Windowing • Fault Tolerance with exactly once guarantees • Large (and growing!) community 35Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  36. 36. Outlook: Introducing the BBDC 36 http://bbdc.berlin Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  37. 37. BBDC Technology (10.000 feet view) 37Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  38. 38. 38 http://flink-forward.org Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  39. 39. Thank you 39 If you find this exciting, get involved on Flink‘s mailing list or stay tuned by subscribing to news@flink.apache.org, following flink.apache.org/blog, and @ApacheFlink on Twitter Technische Universität Berlin - The Apache Flink Platform for Parallel Batch and Stream Analysis - FGDB 2015
  40. 40. Technische Universität Berlin DIMA – Databases and Information Management Group The Apache Flink Platform for Parallel Batch and Stream Analysis Jonas Traub | Tilmann Rabl | Fabian Hueske | Till Rohrmann | Volker Markl

×