Future-proof, portable batch
and streaming pipelines
using Apache Beam
Malo Deniélou
Senior Software Engineer at Google
malo@google.com
Big Data Apps Meetup, May 2017
Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments
Why use Apache Beam?
1. Unified: The full spectrum from batch to streaming
2. Extensible: Transform libraries, IO Connectors, and DSLs -- oh my!
3. Portable: Write once, run anywhere
4. Demo: Beam, Dataflow, Spark, Kafka, Flink, ...
5. Getting Started: Beaming into the Future
A free-to-play gaming analytics use case
● How many players made it to stage 12?
● What path did they take through the stage?
● Team points and other stats at this point in time?
● Of the players who took the same route where a certain
condition was true, how many made an in-app purchase?
● What are the characteristics of the player segment who
didn’t make the purchase vs. those who did?
● Why was this custom event so successful in driving in-app
purchases compared to others?
You need Key indicators specific to your game in order to increase adoption, engagement, etc.
The Solution
Collect real-time
game events and
player data
Combine data in
meaningful ways
Apply realtime
and historical
batch analysis
=
Impact engagement,
retention, and spend.
How this would look on Google Cloud
Lots of alternatives …
No
API!
… but I need to rewrite all my pipelines!
Here comes Beam: your portable data processing API
No technological or environmental lock-in!
What is being computed?
Extensible PTransforms let you
build pipelines modularly.
What is being computed?
Where in
event time?
Leaderboard
Streaming
example
Demo!
Beam, Dataflow, Spark, Kafka, Flink, ...
● Beam’s Java SDK runs on multiple
runtime environments, including:
○ Apache Apex
○ Apache Spark
○ Apache Flink
○ Google Cloud Dataflow
○ [in development] Apache Gearpump
● Cross-language infrastructure is in
progress.
○ Beam’s Python SDK currently runs
on Google Cloud Dataflow
● First stable version at the end of May!
Beam Vision: as of May 2017
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump
How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
Beam: the intersection of runner functionality?
Beam: the union of runner functionality?
Beam: the future!
Getting Started with Apache Beam
Beaming into the Future
The Beam Model
The Dataflow Model paper from VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
Streaming 101 and 102: The World Beyond Batch
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Beam
Apache Beam: http://beam.apache.org
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Detailed documentation
Thanks!
Additional slides
Unified
The full spectrum from batch to streaming
Less code, better code
Processing time vs. event time
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
The Beam Model: Where in event time?
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
The Beam Model: When in processing time?
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
The Beam Model: How do refinements relate?
The Beam Model: How do refinements relate?
Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch
Extensible
Transform libraries, IO connectors, and DSLs -- Oh my!
PTransforms
● PTransforms let you build pipelines modularly.
● All transformations are equal, so it’s easy to add new
libraries.
● Runtime can use this structure to communicate with the user.
More details on the Beam Blog: Where’s My PCollection.map()?
Source API
● Source API allows users to
teach the system about new
bounded and unbounded
data formats.
More details on the Beam Blog: Dynamic Work Rebalancing for Beam
● Even the most careful
hand-tuning will fail as data,
code, and environments shift.
● Beam’s Source API is
designed to provide runtime
hooks for efficient scaling.
Processed & Committed
Processed and Uncommitted
Unprocessed
Language-specific SDKS and DSLs
Language B
SDK
Language A
SDK
Language C
SDK
The Beam Model
DSL X DSL Z
● Multiple language-specific
SDKs that implement the full
Beam model.
● Optional domain specific
languages that narrow or
transform the abstractions
to align with simple use
cases or specific user
communities.
Portable
Write once, run anywhere
Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache Beam
Language B
SDK
Language A
SDK
Language C
SDK
Runner 1 Runner 3Runner 2
● Choice of API: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runtime: Users choose
the right runner for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model
Example Beam Runners
Apache Spark
● Open-source
cluster-computing
framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform
Demo screenshots
because if I make them, I won’t need to use them
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)
Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

  • 1.
    Future-proof, portable batch andstreaming pipelines using Apache Beam Malo Deniélou Senior Software Engineer at Google malo@google.com Big Data Apps Meetup, May 2017
  • 2.
    Apache Beam: OpenSource data processing APIs ● Expresses data-parallel batch and streaming algorithms using one unified API ● Cleanly separates data processing logic from runtime requirements ● Supports execution on multiple distributed processing runtime environments
  • 3.
    Why use ApacheBeam? 1. Unified: The full spectrum from batch to streaming 2. Extensible: Transform libraries, IO Connectors, and DSLs -- oh my! 3. Portable: Write once, run anywhere 4. Demo: Beam, Dataflow, Spark, Kafka, Flink, ... 5. Getting Started: Beaming into the Future
  • 4.
    A free-to-play gaminganalytics use case ● How many players made it to stage 12? ● What path did they take through the stage? ● Team points and other stats at this point in time? ● Of the players who took the same route where a certain condition was true, how many made an in-app purchase? ● What are the characteristics of the player segment who didn’t make the purchase vs. those who did? ● Why was this custom event so successful in driving in-app purchases compared to others? You need Key indicators specific to your game in order to increase adoption, engagement, etc.
  • 5.
    The Solution Collect real-time gameevents and player data Combine data in meaningful ways Apply realtime and historical batch analysis = Impact engagement, retention, and spend.
  • 6.
    How this wouldlook on Google Cloud
  • 7.
  • 8.
    No API! … but Ineed to rewrite all my pipelines!
  • 9.
    Here comes Beam:your portable data processing API No technological or environmental lock-in!
  • 11.
    What is beingcomputed?
  • 12.
    Extensible PTransforms letyou build pipelines modularly.
  • 13.
    What is beingcomputed? Where in event time?
  • 14.
  • 15.
  • 16.
    ● Beam’s JavaSDK runs on multiple runtime environments, including: ○ Apache Apex ○ Apache Spark ○ Apache Flink ○ Google Cloud Dataflow ○ [in development] Apache Gearpump ● Cross-language infrastructure is in progress. ○ Beam’s Python SDK currently runs on Google Cloud Dataflow ● First stable version at the end of May! Beam Vision: as of May 2017 Beam Model: Fn Runners Apache Spark Cloud Dataflow Beam Model: Pipeline Construction Apache Flink Java Java Python Python Apache Apex Apache Gearpump
  • 17.
    How do youbuild an abstraction layer? Apache Spark Cloud Dataflow Apache Flink ???????? ????????
  • 18.
    Beam: the intersectionof runner functionality?
  • 19.
    Beam: the unionof runner functionality?
  • 20.
  • 21.
    Getting Started withApache Beam Beaming into the Future
  • 22.
    The Beam Model TheDataflow Model paper from VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf Streaming 101 and 102: The World Beyond Batch https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 Beam
  • 23.
    Apache Beam: http://beam.apache.org Quickstarts ●Java SDK ● Python SDK Example walkthroughs ● Word Count ● Mobile Gaming Detailed documentation
  • 24.
  • 25.
  • 26.
    Unified The full spectrumfrom batch to streaming Less code, better code
  • 27.
  • 28.
    The Beam Model:asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?
  • 29.
    PCollection<KV<String, Integer>> scores= input .apply(Sum.integersPerKey()); The Beam Model: What is being computed?
  • 30.
    The Beam Model:What is being computed?
  • 31.
    PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))) .apply(Sum.integersPerKey()); The Beam Model: Where in event time?
  • 32.
    The Beam Model:Where in event time?
  • 33.
    PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark())) .apply(Sum.integersPerKey()); The Beam Model: When in processing time?
  • 34.
    The Beam Model:When in processing time?
  • 35.
    PCollection<KV<String, Integer>> scores= input .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering(AtWatermark() .withEarlyFirings( AtPeriod(Duration.standardMinutes(1))) .withLateFirings(AtCount(1))) .accumulatingFiredPanes()) .apply(Sum.integersPerKey()); The Beam Model: How do refinements relate?
  • 36.
    The Beam Model:How do refinements relate?
  • 37.
    Customizing What WhereWhen How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch
  • 38.
    Extensible Transform libraries, IOconnectors, and DSLs -- Oh my!
  • 39.
    PTransforms ● PTransforms letyou build pipelines modularly. ● All transformations are equal, so it’s easy to add new libraries. ● Runtime can use this structure to communicate with the user. More details on the Beam Blog: Where’s My PCollection.map()?
  • 40.
    Source API ● SourceAPI allows users to teach the system about new bounded and unbounded data formats. More details on the Beam Blog: Dynamic Work Rebalancing for Beam ● Even the most careful hand-tuning will fail as data, code, and environments shift. ● Beam’s Source API is designed to provide runtime hooks for efficient scaling. Processed & Committed Processed and Uncommitted Unprocessed
  • 41.
    Language-specific SDKS andDSLs Language B SDK Language A SDK Language C SDK The Beam Model DSL X DSL Z ● Multiple language-specific SDKs that implement the full Beam model. ● Optional domain specific languages that narrow or transform the abstractions to align with simple use cases or specific user communities.
  • 42.
  • 43.
    Beam Vision: mixand match SDKs and runtimes ● The Beam Model: the abstractions at the core of Apache Beam Language B SDK Language A SDK Language C SDK Runner 1 Runner 3Runner 2 ● Choice of API: Users write their pipelines in a language that’s familiar and integrated with their other tooling ● Choice of Runtime: Users choose the right runner for their current needs -- on-prem / cloud, open source / not, fully managed / not ● Scalability for Developers: Clean APIs allow developers to contribute modules independently The Beam Model Language A Language CLanguage B The Beam Model
  • 44.
    Example Beam Runners ApacheSpark ● Open-source cluster-computing framework ● Large ecosystem of APIs and tools ● Runs on premise or in the cloud Apache Flink ● Open-source distributed data processing engine ● High-throughput and low-latency stream processing ● Runs on premise or in the cloud Google Cloud Dataflow ● Fully-managed service for batch and stream data processing ● Provides dynamic auto-scaling, monitoring tools, and tight integration with Google Cloud Platform
  • 45.
    Demo screenshots because ifI make them, I won’t need to use them