Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Beam: A unified model for batch and stream processing data

6,001 views

Published on

Apache Beam: A unified model for batch and stream processing data

Published in: Technology
  • Be the first to comment

Apache Beam: A unified model for batch and stream processing data

  1. 1. Unbounded, unordered, global scale datasets are increasingly common in day-today business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam (incubating) defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtime environments, both open source (e.g., Apache Flink, Apache Spark, et al.), and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in the programming model. During the talk, we’ll argue why Beam is unified, efficient and portable. Abstract
  2. 2. Davor Bonaci Apache Beam PPMC Software Engineer, Google Inc. Apache Beam: A Unified Model for Batch and Streaming Data Processing Hadoop Summit, June 28-30, 2016, San Jose, CA
  3. 3. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  4. 4. 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines: • Java • Python 3. Runners for Existing Distributed Processing Backends • Apache Flink • Apache Spark • Google Cloud Dataflow • Local runner for testing What is Apache Beam? Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  5. 5. The Evolution of Beam MapReduce Google Cloud Dataflow Apache Beam BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel
  6. 6. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  7. 7. Data...
  8. 8. ...can be big...
  9. 9. ...really, really big... Tuesday Wednesday Thursday
  10. 10. … maybe infinitely big... 9:008:00 14:0013:0012:0011:0010:00
  11. 11. … with unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  12. 12. Reality Formalizing Event-Time Skew ProcessingTime Event Time Ideal Skew
  13. 13. Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" ProcessingTime Event Time ~Watermark Ideal Skew Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  14. 14. What are you computing? Where in event time? When in processing time? How do refinements relate?
  15. 15. What are you computing? Element-Wise Aggregating Composite
  16. 16. What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());
  17. 17. What: Computing Integer Sums
  18. 18. What: Computing Integer Sums
  19. 19. Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data. Where in event time? Fixed Sliding 1 2 3 54 Sessions 2 431 Key 2 Key 1 Key 3 Time 2 3 4
  20. 20. Where: Fixed 2-minute Windows PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());
  21. 21. Where: Fixed 2-minute Windows
  22. 22. When in processing time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. ProcessingTime Event Time ~Watermark Ideal Skew
  23. 23. When: Triggering at the Watermark PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  24. 24. When: Triggering at the Watermark
  25. 25. When: Early and Late Firings PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  26. 26. When: Early and Late Firings
  27. 27. How do refinements relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Speculative [3] Watermark [5, 1] Late [2] Last Observed Total Observed Discarding 3 6 2 2 11 Accumulating 3 9 11 11 23 Acc. & Retracting 3 9, -3 11, -9 11 11 (Accumulating & Retracting not yet implemented.)
  28. 28. How: Add Newest, Remove Previous PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  29. 29. How: Add Newest, Remove Previous
  30. 30. Correctness Power Composability Flexibility Modularity What / Where / When / How
  31. 31. Correctness Power Composability Flexibility Modularity What / Where / When / How
  32. 32. Distributed Systems are Distributed
  33. 33. Event Time Results are Stable
  34. 34. Correctness Power Composability Flexibility Modularity What / Where / When / How
  35. 35. Sessions PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  36. 36. Identifying Bursts of User Activity
  37. 37. Correctness Power Composability Flexibility Modularity What / Where / When / How
  38. 38. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  39. 39. Correctness Power Composability Flexibility Modularity What / Where / When / How
  40. 40. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  41. 41. Correctness Power Composability Flexibility Modularity What / Where / When / How
  42. 42. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  43. 43. Workload vary in pipelines over timeWorkload Time Batch pipelines go through stagesStreaming pipeline’s input varies
  44. 44. Perils of fixed decisionsWorkload Time Under-provisioned / average caseOver-provisioned / worst case Workload Time
  45. 45. Ideal case Workload Time
  46. 46. Solution: bundles class MyDoFn extends DoFn<String, String> { void startBundle(...) { } void processElement(...) { } void finishBundle(...) { } } • User code operates on bundles of elements. • Easy parallelization. • Dynamic sizing. • Parallelism decisions in the runner’s hands.
  47. 47. The Straggler Problem • Work is unevenly distributed across tasks. • Reasons: • Underlying data. • Processing. • Effects multiplied per stage. Worker Time
  48. 48. “Standard” workarounds for stragglers • Split files into equal sizes? • Pre-emptively over-split? • Detect slow workers and re- execute? • Sample extensively and then split? Worker Time
  49. 49. No amount of upfront heuristic tuning (be it manual or automatic) is enough to guarantee good performance: the system will always hit unpredictable situations at run-time. A system that's able to dynamically adapt and get out of a bad situation is much more powerful than one that heuristically hopes to avoid getting into it.
  50. 50. Solution: Dynamic Work Rebalancing Done work Active work Predicted completion Split Now Average completion Time Now Average completion Time
  51. 51. Solution: Dynamic work rebalancing class MyReader extends BoundedReader<T> { [...] getFractionConsumed() { } splitAtFraction(...) { } } class MySource extends BoundedSource<T> { [...] splitIntoBundles() { } getEstimatedSizeBytes() { } }
  52. 52. Real world example
  53. 53. Dynamic bundles + work re-balancing + autoscaling
  54. 54. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  55. 55. 1. Write: Choose an SDK to write your pipeline in. 2. Execute: Choose any runner at execution time. Apache Beam Architecture Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  56. 56. Categorizing Runner Capabilities http://beam.incubator.apache.org/capability-matrix/
  57. 57. 1. End users: who want to write pipelines or transform libraries in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Multiple categories of users Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Google Cloud Dataflow Execution
  58. 58. • If you have Big Data APIs, write a Beam SDK or DSL or library of transformations. • If you have a distributed processing backend, write a Beam runner! • If you have a data storage or messaging system, write a Beam IO connector! Growing the Open Source Community
  59. 59. Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines
  60. 60. Visions are a Journey 02/01/2016 Enter Apache Incubator End 2016 Beam pipelines run on many runners in production uses Early 2016 Design for use cases, begin refactoring Mid 2016 Additional refactoring, non-production uses Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository 06/14/2016 1st incubating release June 2016 Python SDK moves to Beam
  61. 61. Learn More! Apache Beam (incubating) http://beam.incubator.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Follow @ApacheBeam on Twitter
  62. 62. Thank you!
  63. 63. Extra material
  64. 64. Element-wise transformations 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  65. 65. Aggregating via Processing-Time Windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  66. 66. Aggregating via Event-Time Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  67. 67. Processing Time Results Differ
  68. 68. Identifying Bursts of User Activity
  69. 69. Correctness Power Composability Flexibility Modularity What / Where / When / How
  70. 70. Calculating Session Lengths input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));
  71. 71. Calculating the Average Session Length .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally()); input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));

×