SlideShare a Scribd company logo
1 of 70
Motivation for and Introduction to
Beam and Dataflow
Eric Anderson, Google PM
@ericmander
With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau
History of distributed data processing
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
What problems are left to solve?
Well, lots, but to name a few...
- Streaming Ops
- Straggling workers
- Flexible/autoscaled resources
- Unifying streaming/batch
- Event time processing
- Job portability
History of distributed data processing
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel
Apache
Beam
Google Cloud
Dataflow
What problems are left to solve?
Well, lots, but to name a few...
- Streaming Ops
- Straggling workers
- Flexible/autoscaled resources
- Unifying streaming/batch
- Event time processing
- Job portability
Cloud Logs
Google App Engine
Google Analytics
Premium
Cloud Pub/Sub
BigQuery Storage
(tables)
Cloud Bigtable
(NoSQL)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)
Capture Store Analyze
Batch
Cloud DataStore
Process
Stream
Cloud Monitoring
Cloud
Bigtable
Real time analytics
and Alerts
Cloud Dataflow
Cloud Dataproc
Integrated: Part of Google Cloud Platform
Cloud Dataproc
6
Integrated: Monitoring UI
Integrated: Distributed Logging
Motivation for Beam and Dataflow
Dataflow contains Google’s latest
approaches to streaming ops,
straggler mitigation and
autoscaling.
Beam SDK provides an IO API that
plumbs signals required for next
generation data processing.
Can be extended to new sources
or runners (execution framework).
Apache Beam
How to update/upgrade a
production streaming pipeline?
Without...
● losing data in pipeline
● missing new data during
downtime
● comprising de-dupe
guarantees
● impacting write pattern
Streaming Ops
Streaming Ops
Common approaches
Parallel swap
● Deploy a 2nd pipeline
● swap traffic over
● Take down the 1st
Parallel swap Stop & restart
Preserve data in pipeline
Avoid missing new data
De-dupe guarantees
Preserve write pattern
Stop & Restart
● Stop pulling from sources
● Drain pipeline; then stop it
● Start new pipeline
● Resume sources
Streaming Ops
Ideal approach
Update in-place
● Submit a new job with the
same name and --update flag
Parallel swap Stop & restart Update in-place
Preserve data in pipeline
Avoid missing new data
De-dupe guarantees
Preserve write pattern
Status
● It works!
● ...for minor pipeline changes
● Included in Dataflow
Straggling workers
A system is only as fast as its slowest worker
Workers lag because:
● Variable machines
● Unequal work
○ Poor distribution of elements
○ Variable work per element
Fixes?
● Filter bad machines?
● Backup / speculative execution?
Straggling workers
Fixes for variable machines
● Filter bad machines
○ Test each machine at spin up
and keep the good ones
● Backup / speculative execution
○ Assign long running work
twice to a backup worker and
keep whatever finishes first
Dynamic Work Rebalancing
Identify where work can be split
Dynamically rebalance it onto another worker
Make sure everything is processed exactly once
Dynamic Work Rebalancing
Status:
● Works great
● Requires signals from source (Beam IO API)
● Included in Dataflow
All machines
Scaling clusters & jobs
Traditional model: Static set of machines
● Allocate a data cluster
● Deploy jobs to the cluster
● Compute coupled to storage
New model: Cloud is infinite and ephemeral
● Separate compute from storage?
● Dynamic allocation to cluster?
● Dynamic allocation to job?
What signals to scale on?
What work do you give new workers?
Cluster
Job
Cloud
Scaling clusters & jobs
Common approach
● Framework scales job
● Cloud provider scales cluster
● What signals to use?
● How to allocate work to new machines?
Ideal approach
● Job, cluster 1:1 (Job = cluster)
● Machines spin up/down fast
● Data processing specific signals
● Can split work on the fly
Cluster
Job
Framework scaling
Infra scaling
Scaling clusters & jobs
Status
● Signals provided by Beam SDK:
○ How much work there is to do (backlog)
○ How parallelizable is it
● Works great in batch (see right)
● Streaming progressing well
● Only think in terms of jobs
Data...
...can be big...
...really, really big...
Tuesday
Wednesday
Thursday
… maybe infinitely big...
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
1 + 1 = 2
Completeness Latency Cost
$$$
Data Processing Tradeoffs
Requirements: Billing Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Backfill Pipeline
Completeness Low Latency Low Cost
Important
Not Important
The Evolution of the Beam Model2
(Produce)
MapReduce: Batch Processing
(Prepare)
Map
(Shuffle)
Reduce
FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on what you want to do to
your data, not what the
underlying system supports.
● A graph of transformations is
automatically transformed into an
optimized series of MapReduces.
MapReduce
Batch Patterns: Creating Structured Data
MapReduce
Batch Patterns: Repetitive Runs
Tuesday
Wednesday
Thursday
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 - 19:00)
[19:00 - 20:00)
[21:00 - 22:00)
[22:00 - 23:00)
[23:00 - 0:00)
Batch Patterns: Time Based Windows
MapReduce
TuesdayWednesday
Batch Patterns: Sessions
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG of
computations to be performed
● System manages state and
persistent flow of elements
Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Event-Time Based Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Streaming Patterns: Session Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:0012:00
Input
Output
Formalizing Event-Time Skew
Watermarks describe event time
progress.
"No timestamp earlier than the
watermark will be seen"
Often heuristic-based.
Too Slow? Results are delayed.
Too Fast? Some data is late.
Streaming or Batch?
1 + 1 = 2 $$$
Completeness Latency Cost
Why not both?
What, Where, When, and How3
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
What Where When How
Element-Wise Aggregating Composite
What: Computing Integer Sums
// Collection of raw log lines
PCollection<String> raw = IO.read(...);
// Element-wise transformation into team/score pairs
PCollection<KV<String, Integer>> input =
raw.apply(ParDo.of(new ParseFn());
// Composite transformation containing an aggregation
PCollection<KV<String, Integer>> scores =
input.apply(Sum.integersPerKey());
What Where When How
*All code snippets are pseudo-java -- details shortened or elided for clarity.
What: Computing Integer Sums
What Where When How
What: Computing Integer Sums
What Where When How
Windowing divides data into event-time-based finite chunks.
Often required when doing aggregations over unbounded data.
Where in event time?
What Where When How
Fixed Sliding
1 2 3
54
Sessions
2
431
Key
2
Key
1
Key
3
Time
1 2 3 4
Where: Fixed 2-minute Windows
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
Where: Fixed 2-minute Windows
What Where When How
When in processing time?
What Where When How
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
When: Triggering at the Watermark
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
When: Triggering at the Watermark
What Where When How
When: Early and Late Firings
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1))))
.apply(Sum.integersPerKey());
When: Early and Late Firings
What Where When How
How do refinements relate?
What Where When How
• How should multiple outputs per window
accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting*
3
9, -3
11, -9
11
*Accumulating & Retracting not yet implemented in Apache Beam.
How: Add Newest, Remove Previous
What Where When How
PCollection<KV<String, Integer>> scores = input
.apply(Window
.into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.triggering(AtWatermark()
.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingAndRetractingFiredPanes())
.apply(Sum.integersPerKey());
How: Add Newest, Remove Previous
What Where When How
1.Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming With
Retractions
4. Streaming with
Speculative + Late Data
Customizing What When Where How
What Where When How
Apache Beam (incubating)4
a unified model for
batch and stream processing
supporting multiple runtimes
a great place to run Beam
Apache Beam Google Cloud Dataflow
The Dataflow Model & Cloud DataflowBeam
1. The Beam Model: What / Where / When / How
2. SDKs for writing Beam pipelines -- starting with Java
3. Runners for Existing Distributed Processing Backends
• Apache Flink (thanks to data Artisans)
• Apache Spark (thanks to Cloudera)
• Google Cloud Dataflow (fully managed service)
• Local (in-process) runner for testing
What is Part of Apache Beam?
1. End users: who want to write
pipelines in a language that’s
familiar.
2. SDK writers: who want to make
Beam concepts available in new
languages.
3. Runner writers: who have a
distributed processing
environment and want to support
Beam pipelines
Apache Beam Technical Vision
Beam Model: Fn Runners
Runner A Runner B
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
Categorizing Runner Capabilities
http://beam.incubator.apache.org/capability-matrix/
Apache Beam Roadmap
02/01/2016
Enter Apache
Incubator
Early 2016
Internal API redesign
Slight Chaos
Mid 2016
API Stabilization
Late 2016
Multiple runners
execute Beam
pipelines
02/25/2016
1st commit to
ASF repository
Learn More!
Apache Beam (incubating)
http://beam.incubator.apache.org
The World Beyond Batch 101 & 102
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Join the Beam mailing lists!
user-subscribe@beam.incubator.apache.org
dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Motivation for Beam and Dataflow
Eric Anderson, Google PM
@ericmander
With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau
Demo
https://github.
com/silviulica/WorkflowExamples/blob/master/noteboo
ks/DataflowIntroduction.ipynb

More Related Content

What's hot

Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
PyData
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Trieu Nguyen
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 

What's hot (20)

Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache BeamMalo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Sourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache BeamSourabh Bajaj - Big data processing with Apache Beam
Sourabh Bajaj - Big data processing with Apache Beam
 
Dataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDataflow - A Unified Model for Batch and Streaming Data Processing
Dataflow - A Unified Model for Batch and Streaming Data Processing
 
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture  for Big Data AppGoogle Cloud Dataflow and lightweight Lambda Architecture  for Big Data App
Google Cloud Dataflow and lightweight Lambda Architecture for Big Data App
 
From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
Riddles of Streaming - Code Puzzlers for Fun & Profit (Nick Dearden, Confluen...
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data ProcessingCloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch -  Dynami...
Flink Forward SF 2017: David Hardwick, Sean Hester & David Brelloch - Dynami...
 
Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache Beam
 

Viewers also liked

Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Data Con LA
 
101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochi
redgang
 

Viewers also liked (20)

Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
 
101129 tokyopref bochibochi
101129 tokyopref bochibochi101129 tokyopref bochibochi
101129 tokyopref bochibochi
 
Dot pab forum september 2011
Dot pab forum september 2011Dot pab forum september 2011
Dot pab forum september 2011
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
 
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
Big Data Day LA 2016/ Data Science Track - Data Storytelling for Impact - Dav...
 
Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.Do you know how the ultra affluent use social media? Find out.
Do you know how the ultra affluent use social media? Find out.
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them6 damaging myths about social media and the truths behind them
6 damaging myths about social media and the truths behind them
 

Similar to Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 

Similar to Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google (20)

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Natural Laws of Software Performance
Natural Laws of Software PerformanceNatural Laws of Software Performance
Natural Laws of Software Performance
 
Transforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big DataTransforming Mobile Push Notifications with Big Data
Transforming Mobile Push Notifications with Big Data
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Build Low Latency, Windowless Event Processing Pipelines with Quine and ScyllaDB
Build Low Latency, Windowless Event Processing Pipelines with Quine and ScyllaDBBuild Low Latency, Windowless Event Processing Pipelines with Quine and ScyllaDB
Build Low Latency, Windowless Event Processing Pipelines with Quine and ScyllaDB
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Deploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache BeamDeploying Models at Scale with Apache Beam
Deploying Models at Scale with Apache Beam
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 

More from Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 

Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing with Apache Beam and Google Cloud Dataflow - Eric Anderson, Product Manager - Google

  • 1. Motivation for and Introduction to Beam and Dataflow Eric Anderson, Google PM @ericmander With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau
  • 2. History of distributed data processing MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel
  • 3. What problems are left to solve? Well, lots, but to name a few... - Streaming Ops - Straggling workers - Flexible/autoscaled resources - Unifying streaming/batch - Event time processing - Job portability
  • 4. History of distributed data processing MapReduce BigTable DremelColossus FlumeMegastoreSpanner PubSub Millwheel Apache Beam Google Cloud Dataflow
  • 5. What problems are left to solve? Well, lots, but to name a few... - Streaming Ops - Straggling workers - Flexible/autoscaled resources - Unifying streaming/batch - Event time processing - Job portability
  • 6. Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Integrated: Part of Google Cloud Platform Cloud Dataproc 6
  • 9. Motivation for Beam and Dataflow Dataflow contains Google’s latest approaches to streaming ops, straggler mitigation and autoscaling. Beam SDK provides an IO API that plumbs signals required for next generation data processing. Can be extended to new sources or runners (execution framework). Apache Beam
  • 10. How to update/upgrade a production streaming pipeline? Without... ● losing data in pipeline ● missing new data during downtime ● comprising de-dupe guarantees ● impacting write pattern Streaming Ops
  • 11. Streaming Ops Common approaches Parallel swap ● Deploy a 2nd pipeline ● swap traffic over ● Take down the 1st Parallel swap Stop & restart Preserve data in pipeline Avoid missing new data De-dupe guarantees Preserve write pattern Stop & Restart ● Stop pulling from sources ● Drain pipeline; then stop it ● Start new pipeline ● Resume sources
  • 12. Streaming Ops Ideal approach Update in-place ● Submit a new job with the same name and --update flag Parallel swap Stop & restart Update in-place Preserve data in pipeline Avoid missing new data De-dupe guarantees Preserve write pattern Status ● It works! ● ...for minor pipeline changes ● Included in Dataflow
  • 13. Straggling workers A system is only as fast as its slowest worker Workers lag because: ● Variable machines ● Unequal work ○ Poor distribution of elements ○ Variable work per element Fixes? ● Filter bad machines? ● Backup / speculative execution?
  • 14. Straggling workers Fixes for variable machines ● Filter bad machines ○ Test each machine at spin up and keep the good ones ● Backup / speculative execution ○ Assign long running work twice to a backup worker and keep whatever finishes first
  • 15. Dynamic Work Rebalancing Identify where work can be split Dynamically rebalance it onto another worker Make sure everything is processed exactly once
  • 16. Dynamic Work Rebalancing Status: ● Works great ● Requires signals from source (Beam IO API) ● Included in Dataflow
  • 17. All machines Scaling clusters & jobs Traditional model: Static set of machines ● Allocate a data cluster ● Deploy jobs to the cluster ● Compute coupled to storage New model: Cloud is infinite and ephemeral ● Separate compute from storage? ● Dynamic allocation to cluster? ● Dynamic allocation to job? What signals to scale on? What work do you give new workers? Cluster Job
  • 18. Cloud Scaling clusters & jobs Common approach ● Framework scales job ● Cloud provider scales cluster ● What signals to use? ● How to allocate work to new machines? Ideal approach ● Job, cluster 1:1 (Job = cluster) ● Machines spin up/down fast ● Data processing specific signals ● Can split work on the fly Cluster Job Framework scaling Infra scaling
  • 19. Scaling clusters & jobs Status ● Signals provided by Beam SDK: ○ How much work there is to do (backlog) ○ How parallelizable is it ● Works great in batch (see right) ● Streaming progressing well ● Only think in terms of jobs
  • 23. … maybe infinitely big... 9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
  • 24. … with unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00
  • 25. 1 + 1 = 2 Completeness Latency Cost $$$ Data Processing Tradeoffs
  • 26. Requirements: Billing Pipeline Completeness Low Latency Low Cost Important Not Important
  • 27. Requirements: Live Cost Estimate Pipeline Completeness Low Latency Low Cost Important Not Important
  • 28. Requirements: Abuse Detection Pipeline Completeness Low Latency Low Cost Important Not Important
  • 29. Requirements: Abuse Detection Backfill Pipeline Completeness Low Latency Low Cost Important Not Important
  • 30. The Evolution of the Beam Model2
  • 32. FlumeJava: Easy and Efficient MapReduce Pipelines ● Higher-level API with simple data processing abstractions. ○ Focus on what you want to do to your data, not what the underlying system supports. ● A graph of transformations is automatically transformed into an optimized series of MapReduces.
  • 34. MapReduce Batch Patterns: Repetitive Runs Tuesday Wednesday Thursday
  • 35. MapReduce Tuesday [11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) [22:00 - 23:00) [23:00 - 0:00) Batch Patterns: Time Based Windows
  • 37. MillWheel: Streaming Computations ● Framework for building low-latency data-processing applications ● User provides a DAG of computations to be performed ● System manages state and persistent flow of elements
  • 38. Streaming Patterns: Element-wise transformations 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 39. Streaming Patterns: Aggregating Time Based Windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  • 40. Streaming Patterns: Event-Time Based Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  • 41. Streaming Patterns: Session Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  • 42. Formalizing Event-Time Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  • 43. Streaming or Batch? 1 + 1 = 2 $$$ Completeness Latency Cost Why not both?
  • 44. What, Where, When, and How3
  • 45. What are you computing? Where in event time? When in processing time? How do refinements relate?
  • 46. What are you computing? What Where When How Element-Wise Aggregating Composite
  • 47. What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(ParDo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(Sum.integersPerKey()); What Where When How *All code snippets are pseudo-java -- details shortened or elided for clarity.
  • 48. What: Computing Integer Sums What Where When How
  • 49. What: Computing Integer Sums What Where When How
  • 50. Windowing divides data into event-time-based finite chunks. Often required when doing aggregations over unbounded data. Where in event time? What Where When How Fixed Sliding 1 2 3 54 Sessions 2 431 Key 2 Key 1 Key 3 Time 1 2 3 4
  • 51. Where: Fixed 2-minute Windows What Where When How PCollection<KV<String, Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey());
  • 52. Where: Fixed 2-minute Windows What Where When How
  • 53. When in processing time? What Where When How • Triggers control when results are emitted. • Triggers are often relative to the watermark.
  • 54. When: Triggering at the Watermark What Where When How PCollection<KV<String, Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey());
  • 55. When: Triggering at the Watermark What Where When How
  • 56. When: Early and Late Firings What Where When How PCollection<KV<String, Integer>> scores = input .apply(Window .into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  • 57. When: Early and Late Firings What Where When How
  • 58. How do refinements relate? What Where When How • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Speculative 3 Watermark 5, 1 Late 2 Total Observ 11 Discarding 3 6 2 11 Accumulating 3 9 11 23 Acc. & Retracting* 3 9, -3 11, -9 11 *Accumulating & Retracting not yet implemented in Apache Beam.
  • 59. How: Add Newest, Remove Previous What Where When How PCollection<KV<String, Integer>> scores = input .apply(Window .into(Sessions.withGapDuration(Duration.standardMinutes(1))) .triggering(AtWatermark() .withEarlyFirings(AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetractingFiredPanes()) .apply(Sum.integersPerKey());
  • 60. How: Add Newest, Remove Previous What Where When How
  • 61. 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming With Retractions 4. Streaming with Speculative + Late Data Customizing What When Where How What Where When How
  • 63. a unified model for batch and stream processing supporting multiple runtimes a great place to run Beam Apache Beam Google Cloud Dataflow The Dataflow Model & Cloud DataflowBeam
  • 64. 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends • Apache Flink (thanks to data Artisans) • Apache Spark (thanks to Cloudera) • Google Cloud Dataflow (fully managed service) • Local (in-process) runner for testing What is Part of Apache Beam?
  • 65. 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Apache Beam Technical Vision Beam Model: Fn Runners Runner A Runner B Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  • 67. Apache Beam Roadmap 02/01/2016 Enter Apache Incubator Early 2016 Internal API redesign Slight Chaos Mid 2016 API Stabilization Late 2016 Multiple runners execute Beam pipelines 02/25/2016 1st commit to ASF repository
  • 68. Learn More! Apache Beam (incubating) http://beam.incubator.apache.org The World Beyond Batch 101 & 102 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Join the Beam mailing lists! user-subscribe@beam.incubator.apache.org dev-subscribe@beam.incubator.apache.org Follow @ApacheBeam on Twitter
  • 69. Motivation for Beam and Dataflow Eric Anderson, Google PM @ericmander With slide contributions from Eugene Kirpichov, Frances Perry, and Tyler Akidau