Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry

8,828 views

Published on

Apache Beam (unified Batch and strEAM processing!) is a new Apache incubator project. Originally based on years of experience developing Big Data infrastructure within Google (such as MapReduce, FlumeJava, and MillWheel), it has now been donated to the OSS community at large.
Come learn about the fundamentals of out-of-order stream processing, and how Beam’s powerful tools for reasoning about time greatly simplify this complex task. Beam provides a model that allows developers to focus on the four important questions that must be answered by any stream processing pipeline:
What results are being calculated?
Where in event time are they calculated?
When in processing time are they materialized?
How do refinements of results relate?
Furthermore, by cleanly separating these questions from runtime characteristics, Beam programs become portable across multiple runtime environments, both proprietary (e.g., Google Cloud Dataflow) and open-source (e.g., Flink, Spark, et al).

Published in: Engineering
  • Be the first to comment

Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry

  1. 1. Frances Perry & Tyler Akidau @francesjperry, @ takidau Apache Beam Committers & Google Engineers Fundamentals of Stream Processing with Apache Beam (incubating) Kafka Summit - April 2016
  2. 2. NOTE: These slides are not being actively maintained. For up to date presentations on Apache Beam, please see: beam.incubator.apache.org/presentation-materials/
  3. 3. Infinite, Out-of-Order Data Sets What, Where, When, How Reasons This is Awesome Agenda Apache Beam (incubating) 2 4 1 3
  4. 4. Infinite, Out-of-Order Data Sets1
  5. 5. Data...
  6. 6. ...can be big...
  7. 7. ...really, really big... Tuesday Wednesday Thursday
  8. 8. … maybe infinitely big... 9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
  9. 9. … with unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  10. 10. Element-wise transformations 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  11. 11. Aggregating via Processing-Time Windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  12. 12. Aggregating via Event-Time Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  13. 13. Reality Formalizing Event-Time Skew ProcessingTime Event Time Ideal Skew
  14. 14. Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" ProcessingTime Event Time ~Watermark Ideal Skew Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  15. 15. What, Where, When, How2
  16. 16. What are you computing? Where in event time? When in processing time? How do refinements relate?
  17. 17. What are you computing? Element-Wise Aggregating Composite
  18. 18. What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey());
  19. 19. What: Computing Integer Sums
  20. 20. What: Computing Integer Sums
  21. 21. Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data. Where in event time? Fixed Sliding 1 2 3 54 Sessions 2 431 Key 2 Key 1 Key 3 Time 2 3 4
  22. 22. Where: Fixed 2-minute Windows PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey());
  23. 23. Where: Fixed 2-minute Windows
  24. 24. When in processing time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. ProcessingTime Event Time ~Watermark Ideal Skew
  25. 25. When: Triggering at the Watermark PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  26. 26. When: Triggering at the Watermark
  27. 27. When: Early and Late Firings PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  28. 28. When: Early and Late Firings
  29. 29. How do refinements relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Speculative [3] Watermark [5, 1] Late [2] Last Observed Total Observed Discarding 3 6 2 2 11 Accumulating 3 9 11 11 23 Acc. & Retracting 3 9, -3 11, -9 11 11 (Accumulating & Retracting not yet implemented.)
  30. 30. How: Add Newest, Remove Previous PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  31. 31. How: Add Newest, Remove Previous
  32. 32. Reasons This is Awesome3
  33. 33. Correctness Power Composability Flexibility Modularity What / Where / When / How
  34. 34. Correctness Power Composability Flexibility Modularity What / Where / When / How
  35. 35. Distributed Systems are Distributed
  36. 36. Processing Time Results Differ
  37. 37. Event Time Results are Stable
  38. 38. Correctness Power Composability Flexibility Modularity What / Where / When / How
  39. 39. Sessions PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(1)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  40. 40. Identifying Bursts of User Activity
  41. 41. Identifying Bursts of User Activity
  42. 42. Correctness Power Composability Flexibility Modularity What / Where / When / How
  43. 43. Calculating Session Lengths input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));
  44. 44. Calculating the Average Session Length .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .withEarlyFirings(AtPeriod(Minutes(1))) .accumulatingFiredPanes()) .apply(Mean.globally()); input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark()) .discardingFiredPanes()) .apply(CalculateWindowLength()));
  45. 45. Correctness Power Composability Flexibility Modularity What / Where / When / How
  46. 46. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  47. 47. Correctness Power Composability Flexibility Modularity What / Where / When / How
  48. 48. PCollection<KV<String, Integer>> scores = input .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(FixedWindows.of(Minutes(2))) .apply(Sum.integersPerKey()); PCollection<KV<String, Integer>> scores = input .apply(Window.into(Sessions.withGapDuration(Minutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey()); 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data 6. Sessions
  49. 49. Correctness Power Composability Flexibility Modularity What / Where / When / How
  50. 50. Apache Beam (incubating)4
  51. 51. The Evolution of Beam MapReduce Google Cloud Dataflow Apache Beam BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel
  52. 52. 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing What is Part of Apache Beam?
  53. 53. 1. End users: who want to write pipelines or transform libraries in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Apache Beam Technical Vision Beam Model: Fn Runners Runner A Runner B Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  54. 54. Visions are a Journey 02/01/2016 Enter Apache Incubator Early 2016 Internal API redesign Slight Chaos Mid 2016 API Stabilization Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository
  55. 55. Categorizing Runner Capabilities http://beam.incubator.apache.org/capability-matrix/
  56. 56. Collaborate - Beam is becoming a community- driven effort with participation from many organizations and contributors Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem Growing the Beam Community
  57. 57. Learn More! Apache Beam (incubating) http://beam.incubator.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Follow @ApacheBeam on Twitter (and @francesjperry and @takidau too!)
  58. 58. Thank you!

×