Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

Future-proof, portable batch
and streaming pipelines
using Apache Beam
Malo Deniélou
Senior Software Engineer at Google
malo@google.com
Big Data Apps Meetup, May 2017

Apache Beam: Open Source data processing APIs
● Expresses data-parallel batch and streaming
algorithms using one unified API
● Cleanly separates data processing logic
from runtime requirements
● Supports execution on multiple distributed
processing runtime environments

Why use Apache Beam?
1. Unified: The full spectrum from batch to streaming
2. Extensible: Transform libraries, IO Connectors, and DSLs -- oh my!
3. Portable: Write once, run anywhere
4. Demo: Beam, Dataflow, Spark, Kafka, Flink, ...
5. Getting Started: Beaming into the Future

A free-to-play gaming analytics use case
● How many players made it to stage 12?
● What path did they take through the stage?
● Team points and other stats at this point in time?
● Of the players who took the same route where a certain
condition was true, how many made an in-app purchase?
● What are the characteristics of the player segment who
didn’t make the purchase vs. those who did?
● Why was this custom event so successful in driving in-app
purchases compared to others?
You need Key indicators specific to your game in order to increase adoption, engagement, etc.

The Solution
Collect real-time
game events and
player data
Combine data in
meaningful ways
Apply realtime
and historical
batch analysis
=
Impact engagement,
retention, and spend.

How this would look on Google Cloud

No
API!
… but I need to rewrite all my pipelines!

Here comes Beam: your portable data processing API
No technological or environmental lock-in!

Extensible PTransforms let you
build pipelines modularly.

What is being computed?
Where in
event time?

Demo!
Beam, Dataflow, Spark, Kafka, Flink, ...

● Beam’s Java SDK runs on multiple
runtime environments, including:
○ Apache Apex
○ Apache Spark
○ Apache Flink
○ Google Cloud Dataflow
○ [in development] Apache Gearpump
● Cross-language infrastructure is in
progress.
○ Beam’s Python SDK currently runs
on Google Cloud Dataflow
● First stable version at the end of May!
Beam Vision: as of May 2017
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
Java
Java
Python
Python
Apache
Apex
Apache
Gearpump

How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????

Beam: the intersection of runner functionality?

Beam: the union of runner functionality?

Getting Started with Apache Beam
Beaming into the Future

The Beam Model
The Dataflow Model paper from VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
Streaming 101 and 102: The World Beyond Batch
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
Beam

Apache Beam: http://beam.apache.org
Quickstarts
● Java SDK
● Python SDK
Example walkthroughs
● Word Count
● Mobile Gaming
Detailed documentation

Unified
The full spectrum from batch to streaming
Less code, better code

Processing time vs. event time

The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?

PCollection<KV<String, Integer>> scores = input
.apply(Sum.integersPerKey());
The Beam Model: What is being computed?

The Beam Model: What is being computed?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
The Beam Model: Where in event time?

The Beam Model: Where in event time?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()))
The Beam Model: When in processing time?

The Beam Model: When in processing time?

.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
The Beam Model: How do refinements relate?

The Beam Model: How do refinements relate?

Customizing What Where When How
3
Streaming
4
Streaming
+ Accumulation
1
Classic
Batch
2
Windowed
Batch

Extensible
Transform libraries, IO connectors, and DSLs -- Oh my!

PTransforms
● PTransforms let you build pipelines modularly.
● All transformations are equal, so it’s easy to add new
libraries.
● Runtime can use this structure to communicate with the user.
More details on the Beam Blog: Where’s My PCollection.map()?

Source API
● Source API allows users to
teach the system about new
bounded and unbounded
data formats.
More details on the Beam Blog: Dynamic Work Rebalancing for Beam
● Even the most careful
hand-tuning will fail as data,
code, and environments shift.
● Beam’s Source API is
designed to provide runtime
hooks for efficient scaling.
Processed & Committed
Processed and Uncommitted
Unprocessed

Language-specific SDKS and DSLs
Language B
SDK
Language A
SDK
Language C
SDK
The Beam Model
DSL X DSL Z
● Multiple language-specific
SDKs that implement the full
Beam model.
● Optional domain specific
languages that narrow or
transform the abstractions
to align with simple use
cases or specific user
communities.

Portable
Write once, run anywhere

Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache Beam
Language B
SDK
Language A
SDK
Language C
SDK
Runner 1 Runner 3Runner 2
● Choice of API: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runtime: Users choose
the right runner for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model

Example Beam Runners
Apache Spark
● Open-source
cluster-computing
framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform

Demo screenshots
because if I make them, I won’t need to use them

Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)

Similar to Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017) (20)

Recently uploaded

Recently uploaded (20)

Portable batch and streaming pipelines with Apache Beam (Big Data Applications Meetup May 2017)