The document discusses Apache Beam, a unified model for handling batch and streaming data processing. It highlights the importance of event time versus processing time, as well as the need for user-friendly pipeline writing and extensibility across different execution environments. The presentation emphasizes creating a community-driven effort to grow the Apache Beam ecosystem with contributions from various organizations.
The Beam Vision(for users)
Sum Per Key
15
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud Dataflow
⋮ ⋮
Apache Apex
Apache
Gearpump
(incubating)
16.
Pipeline p =Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
16
17.
The Beam Model:Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
17
18.
The Beam Model:Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
18
Aggregations,
transformations,
...
The Beam Model:What are you computing?
Sum Per Key
20
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey()
| Write(BigQuerySink(...))
Python
http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
21.
The Beam Model:Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
21
Event time
windowing
ProcessingTime
Event Time
Event TimeWindows
30
(implementing processing time windows)
Just throw away
your data's
timestamps and
replace them with
"now()"
31.
input | WindowInto(FixedWindows(3600)
|Sum.PerKey()
| Write(BigQuerySink(...))
Python
The Beam Model: Where in Event Time?
Sum Per Key
Window Into
31
input.apply(
Window.into(
FixedWindows.of(
Duration.standardHours(1)))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
The Beam Model:Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
33
Watermarks
& Triggers
The Beam Model:When in Processing Time?
Sum Per Key
Window Into
42
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow()))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark())
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
Trigger after end
of window
Build a finelytuned trigger for your use case
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1))
53
Bill at end of month
Near real-time estimates
Immediate corrections
Trigger Catalogue
Composite TriggersBasicTriggers
60
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A, B)
AfterAll(A, B)
Repeat(A)
Sequence(A, B)
61.
The Beam Model:Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
61
Accumulation
Mode
62.
The Beam Model:How do refinements relate?
62
input
.apply(Window.into(...).triggering(...).discardingFiredPanes())
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
vs
1
3 7
4
10
5
1
3 7
4
10
15
discarding accumulating
63.
The Beam Model:Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
63
1. End users:who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to run Beam pipelines
Beam Fn API: Invoke user-definable functions
Apache
Flink
Apache
Spark
Beam Runner API: Build and submit a piepline
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
The Beam Vision
Apache
Apex
Apache
Gearpump
(incubating)
66.
Project Setup (visionmeets code)
GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow
apache/incubator-beam
Direct (on your laptop)
Google Cloud Dataflow
Flink
Spark
In pull request: Apex, Gearpump
Integration tests
Runners
Examples
I/O Connectors
sharing
HDFS
Kafka
BigQuery
Google Cloud Storage, Pubsub,
Bigtable, Datastore
In pull request: JMS, Cassandra
Proposed: Sqoop, Parquet, JDBC,
SocketStream, ...
SDKs
67.
Committers from Google,Data Artisans, Cloudera, Talend, Paypal
● ~40 commits/week
● Rigorous code review for every commit
Contributors [with GitHub badges] from:
Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here>
● Improvements to existing I/O connectors
● Improvements to Spark runner
● Utility classes for users
● Documentation fixes
● Bug diagnoses
● New I/O connectors
● Gearpump runner PoC
● Apex runner PoC!
… and it has been awesome
apache/incubator-beam
68.
Java SDK: Transitionfrom Dataflow
Dataflow Java 1.x
Apache Beam Java 0.x
Apache Beam Java 2.x
Bug Fix
Feature
Breaking Change
We
are
here
Feb
2016
Late
2016
Why Apache Beam?
Unified- One model handles batch and
streaming use cases.
Portable - Pipelines can be executed on multiple
execution environments, avoiding lock-in.
Extensible - Supports user and community
driven SDKs, Runners, transformation libraries,
and IO connectors.
71.
Why Apache Beam?
http://data-artisans.com/why-apache-beam/
"Wefirmly believe that the Beam model is the
correct programming model for streaming and
batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-
data/2016/05/why-apache-beam-a-google-
perspective
"We hope it will lead to a healthy ecosystem of
sophisticated runners that compete by making
users happy, not [via] API lock in."
- Tyler Akidau (Google)
72.
72
Creating an ApacheBeam Community
Collaborate - Beam is becoming a community-driven
effort with participation from many organizations and
contributors.
Grow - We want to grow the Beam ecosystem and
community with active, open involvement so Beam is
a part of the larger OSS ecosystem.
We love contributions. Join us!
73.
Apache Beam
http://beam.incubator.apache.org/
Why ApacheBeam? (from Data Artisans)
Why Apache Beam? (from Google)
Programming Model Overviews
Streaming 101
Streaming 102
The Dataflow Beam Model
Join the community!
User discussions - user-subscribe@beam.incubator.apache.org
Development discussions - dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Learn More!
73