Successfully reported this slideshow.
Your SlideShare is downloading. ×

Streaming Data Pipelines With Apache Beam

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
The Future of Service Mesh
The Future of Service Mesh
Loading in …3
×

Check these out next

1 of 71 Ad

Streaming Data Pipelines With Apache Beam

Download to read offline

Presented at All Things Open 2022
Presented by Danny McCormick

Title: Streaming Data Pipelines With Apache Beam
Abstract: Handling big data presents big problems. Along with traditional concerns like scalability and performance, the increasingly common need for live streaming data processing introduces problems like late or incomplete data from flaky data sources. Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines that addresses these challenges. Using one of the open source Beam SDKs, you can build a program that defines a pipeline to be executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

This talk will explore some problems associated with processing large datasets at scale and how you can write Apache Beam pipelines that address those issues. It will include a demo of a basic Beam streaming pipeline.

Takeaways: an understanding of some challenges associated with large datasets, the Apache Beam model, and how to write a basic Beam streaming pipeline

Audience: anyone dealing with big datasets or interested in data processing at scale.

Presented at All Things Open 2022
Presented by Danny McCormick

Title: Streaming Data Pipelines With Apache Beam
Abstract: Handling big data presents big problems. Along with traditional concerns like scalability and performance, the increasingly common need for live streaming data processing introduces problems like late or incomplete data from flaky data sources. Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines that addresses these challenges. Using one of the open source Beam SDKs, you can build a program that defines a pipeline to be executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

This talk will explore some problems associated with processing large datasets at scale and how you can write Apache Beam pipelines that address those issues. It will include a demo of a basic Beam streaming pipeline.

Takeaways: an understanding of some challenges associated with large datasets, the Apache Beam model, and how to write a basic Beam streaming pipeline

Audience: anyone dealing with big datasets or interested in data processing at scale.

Advertisement
Advertisement

More Related Content

More from All Things Open (20)

Recently uploaded (20)

Advertisement

Streaming Data Pipelines With Apache Beam

  1. 1. Streaming Data Pipelines with Apache Beam Danny McCormick
  2. 2. Agenda ● Who am I ● What is Apache Beam ● Beam Basics ● Processing streaming data ● Demo
  3. 3. Who am I
  4. 4. Me!
  5. 5. What is Apache Beam
  6. 6. In the beginning, there was MapReduce Datastore Map Map Map Map Map Map Reduce Reduce Reduce Reduce Reduce Reduce Shuffle Datastore
  7. 7. In the beginning, there was MapReduce
  8. 8. Then came Flume (and Spark, Flink, and many more) Datastore Map Map Datastore Map Group by Key (Reduce) Combine Map Map Combine Datastore Datastore Datastore
  9. 9. From Flume came Beam Datastore Map Map Datastore Map Group by Key (Reduce) Combine Map Map Combine Datastore Datastore Datastore
  10. 10. Unified Model for Batch and Streaming ● Batch processing is a special case of stream processing ● Batch + Stream = Beam
  11. 11. Build your pipeline in whatever language(s) you want… Group by Key
  12. 12. … with whatever execution engine you want Cloud Dataflow Apache Spark Apache Flink Apache Apex Gearpump Apache Samza Apache Nemo (incubating) IBM Streams Group by Key
  13. 13. Beam Basics
  14. 14. Terms ● PCollection - distributed multi-element dataset ● Transform - operation that takes N PCollections and produces M PCollections ● Pipeline - directed acyclic graph of Transforms and PCollections
  15. 15. Basic Beam Graph Source Transform Sink Transform Source Transform Map Transform Combine Transform Sink Transform Sink Transform
  16. 16. Basic Beam Pipeline def add_one(element): return element + 1 import apache_beam as beam with beam.Pipeline() as pipeline: pipeline | beam.io.ReadFromText('gs://some/inputData.txt') | beam.Map(add_one) | beam.io.WriteToText('gs://some/outputData') Read Text file Map Transform Write to text file
  17. 17. How to use Beam to process huge amounts of streaming data
  18. 18. We want to go from this:
  19. 19. To this: Monday Tuesday Wednesday Thursday Friday
  20. 20. To this: 9:00 8:00 14:00 13:00 12:00 11:00 10:00
  21. 21. Streaming data might be: ● Late ● Incomplete ● Rate limited ● Infinite
  22. 22. You will need to make tradeoffs between: ● Cost ● Completeness ● Low Latency
  23. 23. Example 1: Billing Pipeline Completeness Low Latency Low Cost Important Not Important
  24. 24. Example 2: Billing Estimator Completeness Low Latency Low Cost ` Important Not Important
  25. 25. Example 3: Fraud Detection Completeness Low Latency Low Cost ` Important Not Important
  26. 26. Windows 9:00 8:00 14:00 13:00 12:00 11:00 10:00 Aggregate or output Aggregate or output Aggregate or output Aggregate or outpu output
  27. 27. Fixed Windows 9:00 8:00 14:00 13:00 12:00 11:00 10:00 Aggregate or output Aggregate or output Aggregate or output Aggregate or output Aggregate or output Aggregate or output Aggregate or output ggregate r output
  28. 28. Sliding Windows Aggregate or output Aggregate or output Aggregate or output Aggregate or output Aggregate or output Aggregate or output output Agg Aggregate or output
  29. 29. Sliding Windows 9:00 8:00 14:00 13:00 12:00 11:00 10:00 Aggregate or output Aggregate or output ate or output Aggregate or o Aggregate or output Aggregate or output Aggregate or output Aggregate or output
  30. 30. Session Windows 9:00 8:00 14:00 13:00 12:00 11:00 10:00 Aggregate or output Aggregate or output Aggregate or output Aggregate or output A
  31. 31. Global Window 9:00 8:00 14:00 13:00 12:00 11:00 10:00
  32. 32. Code ● items | beam.WindowInto(window.FixedWindows(60)) # 60s fixed windows ● items | beam.WindowInto(window.SlidingWindows(30, 5)) # 30s sliding window every 5s ● items | beam.WindowInto(window.Sessions(10 * 60)) # window breaks after 10 empty min ● items | beam.WindowInto(window.GlobalWindows()) # single global window
  33. 33. Real Time vs Event Time - Expectation Event Time Processing Time
  34. 34. Real Time vs Event Time - Reality Event Time Processing Time
  35. 35. How do we know its safe to finish a window’s work? Event Time Processing Time
  36. 36. Processing Time? Event Time Processing Time
  37. 37. Processing Time? Lots of late data won’t be counted Event Time Processing Time
  38. 38. Beam’s Solution - Watermarks! Event Time Processing Time
  39. 39. Watermarks ● Beam’s notion of when data is complete ● When a watermark passes the end of a window, additional data is late ● Beam has several built in watermark estimators
  40. 40. Example: Timestamp observing estimation Event Time Processing Time
  41. 41. Example: Timestamp observing estimation Event Time Processing Time
  42. 42. Example: Timestamp observing estimation Event Time Processing Time
  43. 43. Example: Timestamp observing estimation Event Time Processing Time
  44. 44. Example: Timestamp observing estimation Event Time Processing Time
  45. 45. Example: Timestamp observing estimation Event Time Processing Time
  46. 46. Example: Timestamp observing estimation Event Time Processing Time
  47. 47. Example: Timestamp observing estimation Event Time Processing Time
  48. 48. Example: Timestamp observing estimation Event Time Processing Time
  49. 49. Example: Timestamp observing estimation Event Time Processing Time
  50. 50. Example: Timestamp observing estimation Event Time Processing Time
  51. 51. Example: Timestamp observing estimation Event Time Processing Time
  52. 52. Example: Timestamp observing estimation Event Time Processing Time Late Data*
  53. 53. Watermarks ● Handled at the source I/O level ● Most pipelines don’t need to implement estimation, but do need to be aware of it
  54. 54. Recall Tradeoffs Completeness Low Latency Low Cost ` Important Not Important
  55. 55. Triggers ● Beam’s mechanism for controlling tradeoffs ● Describe when to emit aggregated results of a single window ● Allow emitting early results or results including late data
  56. 56. Types of Triggers ● Event Time Triggers ● Processing Time Triggers ● Data-Driven Triggers ● Composite Triggers
  57. 57. Set on windows pcollection | WindowInto( FixedWindows(1 * 60), trigger=AfterProcessingTime(1 * 60), accumulation_mode=AccumulationMode.DISCARDING)
  58. 58. Example Triggers ● AfterProcessingTime(delay=1 * 60) ● AfterCount(1) ● AfterWatermark( early=AfterProcessingTime(delay=1 * 60), late=AfterCount(1)) ● AfterAny(AfterCount(1), AfterProcessingTime(delay=1 * 60))
  59. 59. Accumulation Mode ● Describes how to handle data that has already been emitted ● 2 types: Accumulating and Discarding
  60. 60. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.DISCARDING) [5, 8, 3, 1, 2, 6, 9, 7]
  61. 61. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.DISCARDING) [5, 8, 3, 1, 2, 6, 9, 7] -> [5, 8, 3]
  62. 62. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.DISCARDING) [5, 8, 3, 1, 2, 6, 9, 7] -> [5, 8, 3] [1, 2, 6]
  63. 63. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.DISCARDING) [5, 8, 3, 1, 2, 6, 9, 7] -> [5, 8, 3] [1, 2, 6] [9, 7]
  64. 64. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.Accumulating) [5, 8, 3, 1, 2, 6, 9, 7]
  65. 65. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.Accumulating) [5, 8, 3, 1, 2, 6, 9, 7] -> [5, 8, 3]
  66. 66. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.Accumulating) [5, 8, 3, 1, 2, 6, 9, 7] -> [5, 8, 3] [5, 8, 3, 1, 2, 6]
  67. 67. Discarding Accumulation Mode pcollection | WindowInto( FixedWindows(1 * 60), trigger=Repeating(AfterCount(3)), accumulation_mode=AccumulationMode.Accumulating) [5, 8, 3, 1, 2, 6, 9, 7] -> [5, 8, 3] [5, 8, 3, 1, 2, 6] [5, 8, 3, 1, 2, 6, 9, 7]
  68. 68. More! ● Pipeline State ● Timers ● Runner initiated splits ● Self checkpointing ● Bundle finalization
  69. 69. Demo https://github.com/damccorm/ato-demo-2022
  70. 70. Come join our community!
  71. 71. Questions? Slides - shorturl.at/GNU07

Editor's Notes

  • My path:
    Studied at Vanderbilt, got Bachelors + Masters
    Joined Microsoft + worked on Azure DevOps - first started to fall in love with OSS here. Particularly shaped by experiences w/ big OSS repos (GulpJs, Prettier) - maintainers matter!
    Got to work on GitHub Actions, helped v2 GA, authored most first party actions (setup-node, toolkit)
    Joined Google to work on Apache Beam and Google’s execution engine, Dataflow. Currently Apache committer and the technical lead of Google’s Beam and Dataflow Machine Learning team. Neat to be part of a bigger community driven project, where decisions are made on the distribution list, not in company meetings. Full circle; I hope to be like those initial OSS maintainers who welcomed me into open source.
  • Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
  • Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
  • Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
  • Usually not used in streaming scenarios, unless you’re using specific triggering setups
  • Call out its easy to change your aggregation strategy
  • Lots of data will be considered late
  • Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
  • Set when you window
  • Examples:
    Event time - afterwatermark
    Processing time - AfterProcessingTime (early firing)
    AfterCount
  • Highlight areas of growth (ML, x-lang, performance, new SDKs)

×