The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam (incubating) aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms.
In this talk, I will:
Cover briefly the capabilities of the Beam model for data processing and integration with IOs, as well as the current state of the Beam ecosystem.
Discuss the benefits Beam provides regarding portability and ease-of-use.
Demo the same Beam pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Flink on Google Cloud, Apache Spark on AWS, Apache Apex on-premise).
Give a glimpse at some of the challenges Beam aims to address in the future.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Realizing the promise of portability with Apache Beam
1. 1
Realizing the promise of portability
with Apache Beam
https://s.apache.org/beam-portability-slides-jonthebeach
Tyler Akidau
Senior Staff Software Engineer at Google
Apache Beam PMC
@takidau
With many slides by Frances Perry (@francesjperry)
J On the Beach 2017
2. 2
Apache Beam: Open Source data processing APIs
Expresses data-parallel batch and streaming
algorithms using one unified API
Cleanly separates data processing logic from
runtime requirements
Supports execution on multiple distributed
processing runtime environments
3. 3
The evolution of Apache Beam
MapReduce Apache
Beam
Cloud
Dataflow
BigTable DremelColossus
FlumeMegastore Spanner
PubSub
Millwheel
4. 4
Table of Contents
01
02
03
04
Expressing data-parallel pipelines with the Beam model
The Beam vision for portability
Parallel and portable pipelines in practice
Getting Started with Apache Beam
7. 7
The Beam Model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
8. 8
The Beam Model: What is being computed?
PCollection<KV<String, Integer>> input = IO.read(...)
.apply(ParDo.of(new ParseFn());
.apply(Sum.integersPerKey());
9.
10. 10
The Beam Model: Where in event time?
PCollection<KV<String, Integer>> input = IO.read(...)
.apply(ParDo.of(new ParseFn());
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.apply(Sum.integersPerKey());
11.
12. 12
The Beam Model: When in processing time?
PCollection<KV<String, Integer>> input = IO.read(...)
.apply(ParDo.of(new ParseFn());
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark())
.apply(Sum.integersPerKey());
13.
14. 14
The Beam Model: How do refinements relate?
PCollection<KV<String, Integer>> input = IO.read(...)
.apply(ParDo.of(new ParseFn());
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))
.triggering(AtWatermark()
.withEarlyFirings(
AtPeriod(Duration.standardMinutes(1)))
.withLateFirings(AtCount(1)))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
17. 17
02 The Beam vision for portability
“Write once, run anywhere”
18. 18
Beam Vision: mix and match SDKs and runtimes
● The Beam Model: the abstractions
at the core of Apache BeamLanguage A
SDK
Language C
SDK
Runner 1 Runner 3Runner 2
● Choice of SDK: Users write their
pipelines in a language that’s
familiar and integrated with their
other tooling
● Choice of Runners: Users choose
the right runtime for their current
needs -- on-prem / cloud, open
source / not, fully managed / not
● Scalability for Developers: Clean
APIs allow developers to contribute
modules independently
The Beam Model
Language A Language CLanguage B
The Beam Model
Language B
SDK
19. 19
Beam Vision: as of May 2017
First stable release: Beam 2.0.0
Beam’s Java SDK runs on multiple
runtime environments, including:
Apache Apex
Apache Flink
Apache Spark
Google Cloud Dataflow
[in development] Apache Gearpump
Cross-language infrastructure is in
progress.
Beam’s Python SDK currently runs
on Google Cloud Dataflow
Beam Model: Fn Runners
Apache
Spark
Cloud
Dataflow
Beam Model: Pipeline Construction
Apache
Flink
JavaPython
Apache
Apex
Apache
Gearpump
Python Java
20. 20
Example Beam Runners
Apache Spark
● Open-source
cluster-computing
framework
● Large ecosystem of
APIs and tools
● Runs on premise or in
the cloud
Apache Flink
● Open-source
distributed data
processing engine
● High-throughput and
low-latency stream
processing
● Runs on premise or in
the cloud
Google Cloud Dataflow
● Fully-managed service
for batch and stream
data processing
● Provides dynamic
auto-scaling,
monitoring tools, and
tight integration with
Google Cloud
Platform
21. 21
How do you build an abstraction layer?
Apache
Spark
Cloud
Dataflow
Apache
Flink
????????
????????
44. 44
Learn more!
Apache Beam
beam.apache.org
Demo code
github.com/davorbonaci/beam-portability-demo
The World Beyond Batch: Streaming 101 and 102
www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
The DataflowBeam Model paper, VLDB 2015
vldb.org/pvldb/vol8/p1792-Akidau.pdf
Streaming Systems book
www.streamingsystems.net
@takidau on Twitter