Apache Beam: A unified model for batch and stream processing data

Unbounded, unordered, global scale datasets are increasingly common in day-today business,
and consumers of these datasets have detailed requirements for latency, cost, and
completeness.
Apache Beam (incubating) defines a new data processing programming model that evolved
from more than a decade of experience building Big Data infrastructure within Google,
including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Beam handles both batch
and streaming use cases and neatly separates properties of the data from runtime
characteristics, allowing pipelines to be portable across multiple runtime environments, both
open source (e.g., Apache Flink, Apache Spark, et al.), and proprietary (e.g., Google Cloud
Dataflow).
This talk will cover the basics of Apache Beam, touch on its evolution, and describe main
concepts in the programming model. During the talk, we’ll argue why Beam is unified, efficient
and portable.
Abstract

Davor Bonaci
Apache Beam PPMC
Software Engineer, Google Inc.
Apache Beam:
A Unified Model for Batch and
Streaming Data Processing
Hadoop Summit, June 28-30, 2016, San Jose, CA

Apache Beam is
a unified programming model
designed to provide
efficient and portable
data processing pipelines

1. The Beam Model:
What / Where / When / How
2. SDKs for writing Beam pipelines:
• Java
• Python
3. Runners for Existing Distributed
Processing Backends
• Apache Flink
• Apache Spark
• Google Cloud Dataflow
• Local runner for testing
What is Apache Beam?
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution

The Evolution of Beam
MapReduce
Google Cloud
Dataflow
Apache
Beam
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel

...really, really big...
Tuesday
Wednesday
Thursday

… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:00

… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00

Reality
Formalizing Event-Time Skew
ProcessingTime
Event Time
Ideal
Skew

Formalizing Event-Time Skew
Watermarks describe event
time progress.
"No timestamp earlier than the
watermark will be seen"
ProcessingTime
Event Time
~Watermark
Ideal
Skew
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.

What are you computing?
Where in event time?
When in processing time?
How do refinements relate?

What are you computing?
Element-Wise Aggregating Composite

What: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());

Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
2 3 4

Where: Fixed 2-minute Windows
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.apply(Sum.integersPerKey());

When in processing time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
~Watermark
Ideal
Skew

When: Triggering at the Watermark
.apply(Window.into(FixedWindows.of(Minutes(2))
.triggering(AtWatermark()))

When: Triggering at the Watermark

When: Early and Late Firings
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Minutes(1)))
.withLateFirings(AtCount(1))))

How do refinements relate?
• How should multiple outputs per window accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative [3]
Watermark [5, 1]
Late [2]
Last Observed
Total Observed
Discarding
3
6
2
2
11
Accumulating
3
9
11
11
23
Acc. & Retracting
3
9, -3
11, -9
11
11
(Accumulating & Retracting not yet implemented.)

How: Add Newest, Remove Previous
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())

How: Add Newest, Remove Previous

Correctness
Power
Composability
Flexibility
Modularity
What / Where / When / How

Distributed Systems are Distributed

Sessions
.apply(Window.into(Sessions.withGapDuration(Minutes(1))

Identifying Bursts of User Activity

1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming
5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
6. Sessions

.triggering(AtWatermark()))
.apply(Window.into(Sessions.withGapDuration(Minutes(2))
1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming
5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
6. Sessions

Workload vary in pipelines over timeWorkload
Time
Batch pipelines go through stagesStreaming pipeline’s input varies

Perils of fixed decisionsWorkload
Time
Under-provisioned / average caseOver-provisioned / worst case
Workload
Time

Solution: bundles
class MyDoFn extends
DoFn<String, String> {
void startBundle(...) { }
void processElement(...) { }
void finishBundle(...) { }
}
• User code operates on
bundles of elements.
• Easy parallelization.
• Dynamic sizing.
• Parallelism decisions in the
runner’s hands.

The Straggler Problem
• Work is unevenly
distributed across
tasks.
• Reasons:
• Underlying data.
• Processing.
• Effects multiplied per
stage.
Worker
Time

“Standard” workarounds for stragglers
• Split files into equal sizes?
• Pre-emptively over-split?
• Detect slow workers and re-
execute?
• Sample extensively and then
split?
Worker
Time

No amount of upfront heuristic tuning (be it manual or
automatic) is enough to guarantee good performance:
the system will always hit unpredictable situations at run-time.
A system that's able to dynamically adapt and
get out of a bad situation is much more powerful
than one that heuristically hopes to avoid getting into it.

Solution: Dynamic Work Rebalancing
Done work Active work Predicted completion Split
Now Average
completion
Time Now Average
completion
Time

Solution: Dynamic work rebalancing
class MyReader
extends BoundedReader<T> {
[...]
getFractionConsumed() { }
splitAtFraction(...) { }
}
class MySource
extends BoundedSource<T> {
[...]
splitIntoBundles() { }
getEstimatedSizeBytes() { }
}

Dynamic bundles + work re-balancing + autoscaling

1. Write: Choose an SDK to write your
pipeline in.
2. Execute: Choose any runner at
execution time.
Apache Beam Architecture
Apache
Flink
Apache
Spark
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution

Categorizing Runner Capabilities
http://beam.incubator.apache.org/capability-matrix/

1. End users: who want to write
pipelines or transform libraries in a
language that’s familiar.
2. SDK writers: who want to make
Beam concepts available in new
languages.
3. Runner writers: who have a
distributed processing
environment and want to support
Beam pipelines
Multiple categories of users
Apache
Flink
Apache
Spark
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Google
Cloud
Dataflow
Execution

• If you have Big Data APIs, write a Beam
SDK or DSL or library of transformations.
• If you have a distributed processing
backend, write a Beam runner!
• If you have a data storage or messaging
system, write a Beam IO connector!
Growing the Open Source Community

Visions are a Journey
02/01/2016
Enter Apache
Incubator
End 2016
Beam pipelines
run on many
runners in
production uses
Early 2016
Design for use cases,
begin refactoring
Mid 2016
Additional refactoring,
non-production uses
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
06/14/2016
1st incubating
release
June 2016
Python SDK
moves to
Beam

Learn More!
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter

Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time

Aggregating via Processing-Time Windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time

Aggregating via Event-Time Windows
Event Time
Processing
Time 11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output

Processing Time Results Differ

Calculating Session Lengths
input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWatermark())
.discardingFiredPanes())
.apply(CalculateWindowLength()));

Calculating the Average Session Length
.accumulatingFiredPanes())
.apply(Mean.globally());
input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.discardingFiredPanes())
.apply(CalculateWindowLength()));

Apache Beam: A unified model for batch and stream processing data

Apache Beam: A unified model for batch and stream processing data

More Related Content

What's hot

Viewers also liked

Similar to Apache Beam: A unified model for batch and stream processing data

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Apache Beam: A unified model for batch and stream processing data

Editor's Notes