Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google

Motivation for and Introduction to
Beam and Dataflow
Eric Anderson, Google PM
@ericmander
With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau

History of distributed data processing
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel

What problems are left to solve?
Well, lots, but to name a few...
- Streaming Ops
- Straggling workers
- Flexible/autoscaled resources
- Unifying streaming/batch
- Event time processing
- Job portability

History of distributed data processing
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow

Cloud Logs
Google App Engine
Google Analytics
Premium
Cloud Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)
Capture Store Analyze
Batch
Cloud DataStore
Process
Stream
Cloud Monitoring
Cloud
Bigtable
Real time analytics
and Alerts
Cloud Dataflow
Cloud Dataproc
Integrated: Part of Google Cloud Platform
Cloud Dataproc
6

Integrated: Distributed Logging

Motivation for Beam and Dataflow
Dataflow contains Google’s latest
approaches to streaming ops,
straggler mitigation and
autoscaling.
Beam SDK provides an IO API that
plumbs signals required for next
generation data processing.
Can be extended to new sources
or runners (execution framework).
Apache Beam

How to update/upgrade a
production streaming pipeline?
Without...
● losing data in pipeline
● missing new data during
downtime
● comprising de-dupe
guarantees
● impacting write pattern
Streaming Ops

Streaming Ops
Common approaches
Parallel swap
● Deploy a 2nd pipeline
● swap traffic over
● Take down the 1st
Parallel swap Stop & restart
Preserve data in pipeline
Avoid missing new data
De-dupe guarantees
Preserve write pattern
Stop & Restart
● Stop pulling from sources
● Drain pipeline; then stop it
● Start new pipeline
● Resume sources

Streaming Ops
Ideal approach
Update in-place
● Submit a new job with the
same name and --update flag
Parallel swap Stop & restart Update in-place
Preserve data in pipeline
Avoid missing new data
De-dupe guarantees
Preserve write pattern
Status
● It works!
● ...for minor pipeline changes
● Included in Dataflow

Straggling workers
A system is only as fast as its slowest worker
Workers lag because:
● Variable machines
● Unequal work
○ Poor distribution of elements
○ Variable work per element
Fixes?
● Filter bad machines?
● Backup / speculative execution?

Straggling workers
Fixes for variable machines
● Filter bad machines
○ Test each machine at spin up
and keep the good ones
● Backup / speculative execution
○ Assign long running work
twice to a backup worker and
keep whatever finishes first

Dynamic Work Rebalancing
Identify where work can be split
Dynamically rebalance it onto another worker
Make sure everything is processed exactly once

Dynamic Work Rebalancing
Status:
● Works great
● Requires signals from source (Beam IO API)
● Included in Dataflow

All machines
Scaling clusters & jobs
Traditional model: Static set of machines
● Allocate a data cluster
● Deploy jobs to the cluster
● Compute coupled to storage
New model: Cloud is infinite and ephemeral
● Separate compute from storage?
● Dynamic allocation to cluster?
● Dynamic allocation to job?
What signals to scale on?
What work do you give new workers?
Cluster
Job

Cloud
Common approach
● Framework scales job
● Cloud provider scales cluster
● What signals to use?
● How to allocate work to new machines?
Ideal approach
● Job, cluster 1:1 (Job = cluster)
● Machines spin up/down fast
● Data processing specific signals
● Can split work on the fly
Cluster
Job
Framework scaling
Infra scaling

Status
● Signals provided by Beam SDK:
○ How much work there is to do (backlog)
○ How parallelizable is it
● Works great in batch (see right)
● Streaming progressing well
● Only think in terms of jobs

...really, really big...
Tuesday
Wednesday
Thursday

… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00

1 + 1 = 2
Completeness Latency Cost
$$$
Data Processing Tradeoffs

Requirements: Billing Pipeline
Completeness Low Latency Low Cost
Important
Not Important

Requirements: Live Cost Estimate Pipeline
Important
Not Important

Requirements: Abuse Detection Pipeline
Important
Not Important

Requirements: Abuse Detection Backfill Pipeline
Important
Not Important

The Evolution of the Beam Model2

(Produce)
MapReduce: Batch Processing
(Prepare)
Map
(Shuffle)
Reduce

FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on what you want to do to
your data, not what the
underlying system supports.
● A graph of transformations is
automatically transformed into an
optimized series of MapReduces.

MapReduce
Batch Patterns: Creating Structured Data

MapReduce
Batch Patterns: Repetitive Runs
Tuesday
Wednesday
Thursday

MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Batch Patterns: Time Based Windows

MapReduce
TuesdayWednesday
Batch Patterns: Sessions
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday

MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG of
computations to be performed
● System manages state and
persistent flow of elements

Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time

Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time

Streaming Patterns: Event-Time Based Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output

Streaming Patterns: Session Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output

Formalizing Event-Time Skew
Watermarks describe event time
progress.
"No timestamp earlier than the
watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.

Streaming or Batch?
1 + 1 = 2 $$$
Completeness Latency Cost
Why not both?

What are you computing?
Where in event time?
When in processing time?
How do refinements relate?

What are you computing?
What Where When How
Element-Wise Aggregating Composite

What: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());
What Where When How
*All code snippets are pseudo-java -- details shortened or elided for clarity.

What: Computing Integer Sums
What Where When How

Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
What Where When How
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
1 2 3 4

Where: Fixed 2-minute Windows
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());

Where: Fixed 2-minute Windows
What Where When How

When in processing time?
What Where When How
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.

When: Triggering at the Watermark
What Where When How
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))

When: Triggering at the Watermark
What Where When How

When: Early and Late Firings
What Where When How
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))

When: Early and Late Firings
What Where When How

How do refinements relate?
What Where When How
• How should multiple outputs per window
accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting*
3
9, -3
11, -9
11
*Accumulating & Retracting not yet implemented in Apache Beam.

How: Add Newest, Remove Previous
What Where When How
.apply(Window
.into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())

How: Add Newest, Remove Previous
What Where When How

1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
Customizing What When Where How
What Where When How

a unified model for
batch and stream processing
supporting multiple runtimes
a great place to run Beam
Apache Beam Google Cloud Dataflow
The Dataflow Model & Cloud DataflowBeam

1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines -- starting with Java
3. Runners for Existing Distributed Processing Backends
• Apache Flink (thanks to data Artisans)
• Apache Spark (thanks to Cloudera)
• Google Cloud Dataflow (fully managed service)
• Local (in-process) runner for testing
What is Part of Apache Beam?

1. End users: who want to write
pipelines in a language that’s
familiar.
2. SDK writers: who want to make
Beam concepts available in new
languages.
3. Runner writers: who have a
distributed processing
environment and want to support
Beam pipelines
Apache Beam Technical Vision
Beam Model: Fn Runners
Runner A Runner B
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution

Categorizing Runner Capabilities
http://beam.incubator.apache.org/capability-matrix/

Apache Beam Roadmap
02/01/2016
Enter Apache
Incubator
Early 2016
Internal API redesign
Slight Chaos
Mid 2016
API Stabilization
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository

Learn More!
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter

Motivation for Beam and Dataflow
Eric Anderson, Google PM
@ericmander
With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau

Demo
https://github.
com/silviulica/WorkflowExamples/blob/master/noteboo
ks/DataflowIntroduction.ipynb

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google

Similar to Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google