Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
6 damaging myths about social media and the truths behind them
Similar to Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
Similar to Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google (20)
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google
1. Motivation for and Introduction to
Beam and Dataflow
Eric Anderson, Google PM
@ericmander
With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau
2. History of distributed data processing
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
3. What problems are left to solve?
Well, lots, but to name a few...
- Streaming Ops
- Straggling workers
- Flexible/autoscaled resources
- Unifying streaming/batch
- Event time processing
- Job portability
4. History of distributed data processing
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow
5. What problems are left to solve?
Well, lots, but to name a few...
- Streaming Ops
- Straggling workers
- Flexible/autoscaled resources
- Unifying streaming/batch
- Event time processing
- Job portability
6. Cloud Logs
Google App Engine
Google Analytics
Premium
Cloud Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)
Capture Store Analyze
Batch
Cloud DataStore
Process
Stream
Cloud Monitoring
Cloud
Bigtable
Real time analytics
and Alerts
Cloud Dataflow
Cloud Dataproc
Integrated: Part of Google Cloud Platform
Cloud Dataproc
6
9. Motivation for Beam and Dataflow
Dataflow contains Google’s latest
approaches to streaming ops,
straggler mitigation and
autoscaling.
Beam SDK provides an IO API that
plumbs signals required for next
generation data processing.
Can be extended to new sources
or runners (execution framework).
Apache Beam
10. How to update/upgrade a
production streaming pipeline?
Without...
● losing data in pipeline
● missing new data during
downtime
● comprising de-dupe
guarantees
● impacting write pattern
Streaming Ops
11. Streaming Ops
Common approaches
Parallel swap
● Deploy a 2nd pipeline
● swap traffic over
● Take down the 1st
Parallel swap Stop & restart
Preserve data in pipeline
Avoid missing new data
De-dupe guarantees
Preserve write pattern
Stop & Restart
● Stop pulling from sources
● Drain pipeline; then stop it
● Start new pipeline
● Resume sources
12. Streaming Ops
Ideal approach
Update in-place
● Submit a new job with the
same name and --update flag
Parallel swap Stop & restart Update in-place
Preserve data in pipeline
Avoid missing new data
De-dupe guarantees
Preserve write pattern
Status
● It works!
● ...for minor pipeline changes
● Included in Dataflow
13. Straggling workers
A system is only as fast as its slowest worker
Workers lag because:
● Variable machines
● Unequal work
○ Poor distribution of elements
○ Variable work per element
Fixes?
● Filter bad machines?
● Backup / speculative execution?
14. Straggling workers
Fixes for variable machines
● Filter bad machines
○ Test each machine at spin up
and keep the good ones
● Backup / speculative execution
○ Assign long running work
twice to a backup worker and
keep whatever finishes first
15. Dynamic Work Rebalancing
Identify where work can be split
Dynamically rebalance it onto another worker
Make sure everything is processed exactly once
17. All machines
Scaling clusters & jobs
Traditional model: Static set of machines
● Allocate a data cluster
● Deploy jobs to the cluster
● Compute coupled to storage
New model: Cloud is infinite and ephemeral
● Separate compute from storage?
● Dynamic allocation to cluster?
● Dynamic allocation to job?
What signals to scale on?
What work do you give new workers?
Cluster
Job
18. Cloud
Scaling clusters & jobs
Common approach
● Framework scales job
● Cloud provider scales cluster
● What signals to use?
● How to allocate work to new machines?
Ideal approach
● Job, cluster 1:1 (Job = cluster)
● Machines spin up/down fast
● Data processing specific signals
● Can split work on the fly
Cluster
Job
Framework scaling
Infra scaling
19. Scaling clusters & jobs
Status
● Signals provided by Beam SDK:
○ How much work there is to do (backlog)
○ How parallelizable is it
● Works great in batch (see right)
● Streaming progressing well
● Only think in terms of jobs
32. FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on what you want to do to
your data, not what the
underlying system supports.
● A graph of transformations is
automatically transformed into an
optimized series of MapReduces.
37. MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG of
computations to be performed
● System manages state and
persistent flow of elements
40. Streaming Patterns: Event-Time Based Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
41. Streaming Patterns: Session Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
42. Formalizing Event-Time Skew
Watermarks describe event time
progress.
"No timestamp earlier than the
watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.
45. What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
46. What are you computing?
What Where When How
Element-Wise Aggregating Composite
47. What: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());
What Where When How
*All code snippets are pseudo-java -- details shortened or elided for clarity.
50. Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
What Where When How
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
1 2 3 4
51. Where: Fixed 2-minute Windows
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
53. When in processing time?
What Where When How
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
54. When: Triggering at the Watermark
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
56. When: Early and Late Firings
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
58. How do refinements relate?
What Where When How
• How should multiple outputs per window
accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting*
3
9, -3
11, -9
11
*Accumulating & Retracting not yet implemented in Apache Beam.
59. How: Add Newest, Remove Previous
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
61. 1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
Customizing What When Where How
What Where When How
63. a unified model for
batch and stream processing
supporting multiple runtimes
a great place to run Beam
Apache Beam Google Cloud Dataflow
The Dataflow Model & Cloud DataflowBeam
64. 1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines -- starting with Java
3. Runners for Existing Distributed Processing Backends
• Apache Flink (thanks to data Artisans)
• Apache Spark (thanks to Cloudera)
• Google Cloud Dataflow (fully managed service)
• Local (in-process) runner for testing
What is Part of Apache Beam?
65. 1. End users: who want to write
pipelines in a language that’s
familiar.
2. SDK writers: who want to make
Beam concepts available in new
languages.
3. Runner writers: who have a
distributed processing
environment and want to support
Beam pipelines
Apache Beam Technical Vision
Beam Model: Fn Runners
Runner A Runner B
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
67. Apache Beam Roadmap
02/01/2016
Enter Apache
Incubator
Early 2016
Internal API redesign
Slight Chaos
Mid 2016
API Stabilization
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
68. Learn More!
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
69. Motivation for Beam and Dataflow
Eric Anderson, Google PM
@ericmander
With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau