Apache Beam (incubating)
Kenneth Knowles
klk@google.com
@KennKnowles Apache Apex Meetup, 2016-06-27
https://goo.gl/LTLjKt
Motivation
Beam Model
Beam Project / Technical Vision
Agenda
1
2
3
2
3
Motivation1
https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg
4
5
Unbounded, delayed, out of order
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
5
8:00
8:008:00
Incoming!
Score per
user?
6
Organizing the stream
7
8:00
8:00
8:00
Completeness Latency Cost
$$$
Data Processing Tradeoffs
8
What is important for your application?
Completeness Low Latency Low Cost
Important
Not Important
$$$
9
Monthly Billing
Completeness Low Latency Low Cost
Important
Not Important
$$$
10
Billing estimate
Completeness Low Latency Low Cost
Important
Not Important
$$$
11
Abuse Detection
Completeness Low Latency Low Cost
Important
Not Important
$$$
12
13
The Beam Model2
The Beam Model
Pipeline
14
PTransform
PCollection
The Beam Vision (for users)
Sum Per Key
15
input.apply(
Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
Apache Flink
Apache Spark
Cloud Dataflow
⋮ ⋮
Apache Apex
Apache
Gearpump
(incubating)
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*"))
.apply(FlatMapElements.via(line → Arrays.asList(line.split("[^a-zA-Z']+"))))
.apply(Filter.byPredicate(word → !word.isEmpty()))
.apply(Count.perElement())
.apply(MapElements.via(count → count.getKey() + ": " + count.getValue())
.apply(TextIO.Write.to("gs://..."));
p.run();
What your (Java) Code Looks Like
16
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
17
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
18
Aggregations,
transformations,
...
The Beam Model: What are you computing?
Sum Per
User
19
The Beam Model: What are you computing?
Sum Per Key
20
input.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...));
Java
input | Sum.PerKey()
| Write(BigQuerySink(...))
Python
http://beam.apache.org/blog/2016/05/27/where-is-my-pcollection-dot-map.html
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
21
Event time
windowing
22
The Beam Model: Where in Event Time?
8:00
8:00
8:00
Processing Time vs Event Time
Event Time = Processing Time ??
23
Processing Time vs Event Time
24
ProcessingTime
ProcessingTime
Processing Time vs Event Time
Realtime
25
This is not possible
Processing Time vs Event Time
26
Processing Delay
ProcessingTime
Processing Time vs Event Time
Very delayed
27
ProcessingTime
Event Time
Processing Time windows
(probably are not what you want)
ProcessingTime
Event Time 28
Event Time Windows
29
ProcessingTime
Event Time
ProcessingTime
Event Time
Event Time Windows
30
(implementing processing time windows)
Just throw away
your data's
timestamps and
replace them with
"now()"
input | WindowInto(FixedWindows(3600)
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
The Beam Model: Where in Event Time?
Sum Per Key
Window Into
31
input.apply(
Window.into(
FixedWindows.of(
Duration.standardHours(1)))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
So that's what and where...
32
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
33
Watermarks
& Triggers
Event time windows
ProcessingTime
34
Event Time
Fixed cutoff (we can do better)
ProcessingTime
Event Time
35
Allowed
delay
Concurrent windows
Perfect watermark
ProcessingTime
36
Event Time
Check out Slava's
slides from Strata
London 2016 talk on
watermarks:
https://goo.gl/K4FnqQ
Heuristic Watermark
ProcessingTime
37
Event Time
Heuristic Watermark
ProcessingTime
38
Current processing time
Event Time
Heuristic Watermark
ProcessingTime
39
Current processing time
Event Time
Heuristic Watermark
ProcessingTime
40
Current processing time
Late data
Event Time
Watermarks measure completeness
41
$$$
$$$
$$
? Running Total
✔ Monthly billing
? Abuse Detection
The Beam Model: When in Processing Time?
Sum Per Key
Window Into
42
input
.apply(Window.into(FixedWindows.of(...))
.triggering(
AfterWatermark.pastEndOfWindow()))
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
Java
input | WindowInto(FixedWindows(3600),
trigger=AfterWatermark())
| Sum.PerKey()
| Write(BigQuerySink(...))
Python
Trigger after end
of window
ProcessingTime
Event Time
AfterWatermark.pastEndOfWindow()
43
Current processing time
ProcessingTime
Event Time
44
AfterWatermark.pastEndOfWindow()
ProcessingTime
Event Time
Late data
45
Current processing time
AfterWatermark.pastEndOfWindow()
ProcessingTime
Event Time
46
High completeness
Potentially high latency
Low cost
AfterWatermark.pastEndOfWindow()
$$$
ProcessingTime
Event Time
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
47
ProcessingTime
Event Time
48
Current processing time
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
Current processing time
ProcessingTime
Event Time
49
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
ProcessingTime
Event Time
50
Current processing time
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
Current processing time
ProcessingTime
Event Time
51
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
ProcessingTime
Event Time
52
Repeatedly.forever(
AfterPane.elementCountAtLeast(2))
Low completeness
Low latency
Cost driven by input$$$
Build a finely tuned trigger for your use case
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime
.pastFirstElementInPane()
.plusDuration(Duration.standardMinutes(1))
.withLateFirings(AfterPane.elementCountAtLeast(1))
53
Bill at end of month
Near real-time estimates
Immediate corrections
ProcessingTime
Event Time
54
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
ProcessingTime
Event Time
55
Current processing time
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
ProcessingTime
Event Time
56
Current processing time
Low completeness
Low latency
Low cost, driven by time$$$
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Current processing time
ProcessingTime
Event Time
57
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Current processing time
ProcessingTime
Event Time
Late output
58
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
ProcessingTime
Event Time
Late output
59
.withEarlyFirings(after 1 minute)
.withLateFirings(ASAP after each element)
Trigger Catalogue
Composite TriggersBasic Triggers
60
AfterEndOfWindow()
AfterCount(n)
AfterProcessingTimeDelay(Δ)
AfterEndOfWindow()
.withEarlyFirings(A)
.withLateFirings(B)
AfterAny(A, B)
AfterAll(A, B)
Repeat(A)
Sequence(A, B)
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
61
Accumulation
Mode
The Beam Model: How do refinements relate?
62
input
.apply(Window.into(...).triggering(...).discardingFiredPanes())
.apply(Sum.integersPerKey())
.apply(BigQueryIO.Write.to(...))
vs
1
3 7
4
10
5
1
3 7
4
10
15
discarding accumulating
The Beam Model: Asking the Right Questions
What are you computing?
Where in event time?
When in processing time are results produced?
How do refinements relate?
63
64
Beam Project / Technical Vision3
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to run Beam pipelines
Beam Fn API: Invoke user-definable functions
Apache
Flink
Apache
Spark
Beam Runner API: Build and submit a piepline
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
The Beam Vision
Apache
Apex
Apache
Gearpump
(incubating)
Project Setup (vision meets code)
GoogleCloudPlatform/DataflowJavaSDK cloudera/spark-dataflow dataArtisans/flink-dataflow
apache/incubator-beam
Direct (on your laptop)
Google Cloud Dataflow
Flink
Spark
In pull request: Apex, Gearpump
Integration tests
Runners
Examples
I/O Connectors
sharing
HDFS
Kafka
BigQuery
Google Cloud Storage, Pubsub,
Bigtable, Datastore
In pull request: JMS, Cassandra
Proposed: Sqoop, Parquet, JDBC,
SocketStream, ...
SDKs
Committers from Google, Data Artisans, Cloudera, Talend, Paypal
● ~40 commits/week
● Rigorous code review for every commit
Contributors [with GitHub badges] from:
Spotify, Intel, Twitter, Capital One, DataTorrent, …, <your name here>
● Improvements to existing I/O connectors
● Improvements to Spark runner
● Utility classes for users
● Documentation fixes
● Bug diagnoses
● New I/O connectors
● Gearpump runner PoC
● Apex runner PoC!
… and it has been awesome
apache/incubator-beam
Java SDK: Transition from Dataflow
Dataflow Java 1.x
Apache Beam Java 0.x
Apache Beam Java 2.x
Bug Fix
Feature
Breaking Change
We
are
here
Feb
2016
Late
2016
Understanding: Capability Matrix
http://beam.incubator.apache.org/capability-matrix/
Why Apache Beam?
Unified - One model handles batch and
streaming use cases.
Portable - Pipelines can be executed on multiple
execution environments, avoiding lock-in.
Extensible - Supports user and community
driven SDKs, Runners, transformation libraries,
and IO connectors.
Why Apache Beam?
http://data-artisans.com/why-apache-beam/
"We firmly believe that the Beam model is the
correct programming model for streaming and
batch data processing."
- Kostas Tzoumas (Data Artisans)
https://cloud.google.com/blog/big-
data/2016/05/why-apache-beam-a-google-
perspective
"We hope it will lead to a healthy ecosystem of
sophisticated runners that compete by making
users happy, not [via] API lock in."
- Tyler Akidau (Google)
72
Creating an Apache Beam Community
Collaborate - Beam is becoming a community-driven
effort with participation from many organizations and
contributors.
Grow - We want to grow the Beam ecosystem and
community with active, open involvement so Beam is
a part of the larger OSS ecosystem.
We love contributions. Join us!
Apache Beam
http://beam.incubator.apache.org/
Why Apache Beam? (from Data Artisans)
Why Apache Beam? (from Google)
Programming Model Overviews
Streaming 101
Streaming 102
The Dataflow Beam Model
Join the community!
User discussions - user-subscribe@beam.incubator.apache.org
Development discussions - dev-subscribe@beam.incubator.apache.org
Follow @ApacheBeam on Twitter
Learn More!
73
END
74

Apache Beam (incubating)