Presented at All Things Open 2022
Presented by Danny McCormick
Title: Streaming Data Pipelines With Apache Beam
Abstract: Handling big data presents big problems. Along with traditional concerns like scalability and performance, the increasingly common need for live streaming data processing introduces problems like late or incomplete data from flaky data sources. Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines that addresses these challenges. Using one of the open source Beam SDKs, you can build a program that defines a pipeline to be executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.
This talk will explore some problems associated with processing large datasets at scale and how you can write Apache Beam pipelines that address those issues. It will include a demo of a basic Beam streaming pipeline.
Takeaways: an understanding of some challenges associated with large datasets, the Apache Beam model, and how to write a basic Beam streaming pipeline
Audience: anyone dealing with big datasets or interested in data processing at scale.
8. Then came Flume (and Spark, Flink, and many more)
Datastore
Map
Map
Datastore
Map
Group by Key
(Reduce)
Combine
Map
Map
Combine
Datastore
Datastore
Datastore
9. From Flume came Beam
Datastore
Map
Map
Datastore
Map
Group by Key
(Reduce)
Combine
Map
Map
Combine
Datastore
Datastore
Datastore
10. Unified Model for Batch and Streaming
● Batch processing is a special case of
stream processing
● Batch + Stream = Beam
14. Terms
● PCollection - distributed multi-element
dataset
● Transform - operation that takes N
PCollections and produces M PCollections
● Pipeline - directed acyclic graph of
Transforms and PCollections
16. Basic Beam Pipeline
def add_one(element):
return element + 1
import apache_beam as beam
with beam.Pipeline() as pipeline:
pipeline
| beam.io.ReadFromText('gs://some/inputData.txt')
| beam.Map(add_one)
| beam.io.WriteToText('gs://some/outputData')
Read Text
file
Map
Transform
Write to text
file
17. How to use Beam to process huge
amounts of streaming data
28. Sliding Windows
Aggregate or output Aggregate or output Aggregate or output
Aggregate
or
output Aggregate
or
output Aggregate
or
output output
Agg
Aggregate or output
39. Watermarks
● Beam’s notion of when data is
complete
● When a watermark passes the end of
a window, additional data is late
● Beam has several built in watermark
estimators
55. Triggers
● Beam’s mechanism for controlling
tradeoffs
● Describe when to emit aggregated
results of a single window
● Allow emitting early results or results
including late data
56. Types of Triggers
● Event Time Triggers
● Processing Time Triggers
● Data-Driven Triggers
● Composite Triggers
57. Set on windows
pcollection | WindowInto(
FixedWindows(1 * 60),
trigger=AfterProcessingTime(1 * 60),
accumulation_mode=AccumulationMode.DISCARDING)
My path:
Studied at Vanderbilt, got Bachelors + Masters
Joined Microsoft + worked on Azure DevOps - first started to fall in love with OSS here. Particularly shaped by experiences w/ big OSS repos (GulpJs, Prettier) - maintainers matter!
Got to work on GitHub Actions, helped v2 GA, authored most first party actions (setup-node, toolkit)
Joined Google to work on Apache Beam and Google’s execution engine, Dataflow. Currently Apache committer and the technical lead of Google’s Beam and Dataflow Machine Learning team. Neat to be part of a bigger community driven project, where decisions are made on the distribution list, not in company meetings. Full circle; I hope to be like those initial OSS maintainers who welcomed me into open source.
Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
Usually not used in streaming scenarios, unless you’re using specific triggering setups
Call out its easy to change your aggregation strategy
Lots of data will be considered late
Slide adapted from https://docs.google.com/presentation/d/1SHie3nwe-pqmjGum_QDznPr-B_zXCjJ2VBDGdafZme8/edit#slide=id.g12846a6162_0_2098
Set when you window
Examples:
Event time - afterwatermark
Processing time - AfterProcessingTime (early firing)
AfterCount
Highlight areas of growth (ML, x-lang, performance, new SDKs)