Apache Beam is Flink’s sibling in the Apache family of streaming processing frameworks. The Beam and Flink teams work closely together on advancing what is possible in streaming processing, including Streaming SQL extensions and code interoperability on both platforms.
Beam was originally developed at Google as the amalgamation of its internal batch and streaming frameworks to power the exabyte-scale data processing for Gmail, YouTube and Ads. It now powers a fully-managed, serverless service Google Cloud Dataflow, as well as is available to run in other Public Clouds and on-premises when deployed in portability mode on Apache Flink, Spark, Samza and other runners. Users regularly run distributed data processing jobs on Beam spanning tens of thousands of CPU cores and processing millions of events per second.
In this session, Sergei Sokolenko, Cloud Dataflow product manager, and Reuven Lax, the founding member of the Dataflow and Beam team, will share Google’s learnings from building and operating a global streaming processing infrastructure shared by thousands of customers, including:
safe deployment to dozens of geographic locations,
resource autoscaling to minimize processing costs,
separating compute and state storage for better scaling behavior,
dynamic work rebalancing of work items away from overutilized worker nodes,
offering a throughput-optimized batch processing capability with the same API as streaming,
grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system,
integrating with the Google Cloud security ecosystem, and other lessons.
Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.
Docker Logging and analysing with Elastic StackJakub Hajek
Similar to Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google (20)
Scanning the Internet for External Cloud Exposures via SSL Certs
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Beam in The Google Cloud - Sergei Sokolenko & Reuven lax, Google
1. Apache Beam in the Google Cloud
Lessons learned from building and operating a serverless
streaming runtime
Reuven Lax, Google (@reuvenlax)
Sergei Sokolenko, Google (@datancoffee)
2. Common steps in Stream AnalyticsLessons we learned
Watermarks
Adaptive Scaling: Flow Control
Adaptive Scaling: Autoscaling
Separating compute state from storage
3. Common steps in Stream AnalyticsHistory Lesson
2012 20132002 2004 2006 2008 2010
Flume Millwheel
2015
DataflowMillwheelBespoke Streaming
4. Common steps in Stream AnalyticsLesson Learned: Watermarks
A pipeline stage S with a watermark value of t means that all future data that will be seen by S will
have a timestamp later than t. In other words, all data older than t has already been processed.
Key use case: process windows once the watermark passes the end of the window, since we
expect all data for that window to have arrived already
5. Common steps in Stream AnalyticsWhat Triggers Output?
Traditional batch: query triggers output Streaming: When to trigger?
● Standing Query
● Unbounded Data
Query Data
Output
Query Data?
6. Common steps in Stream AnalyticsUse Case: Anomaly Detection pipelines
Early Millwheel user was an anomaly detection
pipeline
Built cubic-spline model for each key
Once a spline was calculated, it could not be
modified. No late data, trigger only when ready!
Bob
Sara
7. Common steps in Stream AnalyticsFirst attempt: leading-edge watermark
Latest timestamp - δ
Graph shows that skew peaked at 10
minutes.
Set δ = 10 minutes to minimize data
drops.
8. Common steps in Stream AnalyticsFirst attempt: leading-edge watermark
Too fast
Too often a lot of data was behind this watermark
Ended up with many gaps in output
Impacted quality of results
Too slow
Subtracting fixed delta puts lower bound on latency
Subtracted 10 minutes because the system is
sometimes delayed by 10 minutes. However most
of the time the delay was under 1 minute!
9. Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark
Leading edge watermark
Dynamic statistical models to compute how far the lookback should be
10. Common steps in Stream AnalyticsSecond attempt: dynamic leading edge watermark
Still many gaps in output data
Input is too noisy
Many delays are unpredictable (e.g. a machine restarting)
Models take time to adapt, in which time you are dropping data
11. Common steps in Stream AnalyticsTrailing edge watermark
Tracking the minimum event time instead generally solved the problem.
12. Common steps in Stream AnalyticsWatermark: Definition
Given a node N of a computation graph G, Let In
be the sequence of input elements processed with
the order provided by an oracle. t: In
-> R is a real-valued function on In
called the timestamp
function. A watermark function is a real-valued function W defined on prefixes of In
satisfying the
following:
{Wn
} = {W({I1
, …, In
})} is eventually increasing.
{Wn
} is monotonic.
W is said to be a temporal watermark, if it also satisfies Wn
< t(Im
) for m >= n
13. Common steps in Stream AnalyticsLoad Variance
Streaming pipelines must keep up with input.
Load varies throughout the day, throughout the week and spikes can happen at any time.
14. Common steps in Stream AnalyticsLoad Variance: Hand Tuning
Every pipeline is different, and hand tuning is hard
Eventually tuning parameters go stale
Hand-tuned flags become cargo cult science
Must tune for worst case
● Tuning for the peak is wasteful
● If pipeline ever falls behind, must be able to catch up faster than real time.
○ An exactly-once streaming system is a batch system whenever it falls behind.
15. Common steps in Stream AnalyticsTechniques: Batching
Always process data in batches
Batch sizes are dynamic: small when caught up, large when while catching up.
Lesson: be careful of putting arbitrary limits on batches.
● Don’t limit by event time ranges - event-time density changes.
● Don’t limit by windows - window policies change.
Batching limits will be especially painful while catching up a backlog.
16. Common steps in Stream AnalyticsTechniques: Flow Control
A good adaptive backpressure system is critical
● Prevents workers from overloading and crashing
● Adaptive backpressure adapts to changing load.
● Reduces need to perfectly tune cluster.
17. Common steps in Stream AnalyticsTechniques: Flow Control
Soft resources: CPU.
Hard resources: Memory.
Signals:
● Queue length
● Memory usage
Eventually flow control will pause
pulling from sources.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
18. Common steps in Stream AnalyticsTechniques: Flow Control
What happens if all streams are flow
controlled?
Deadlock!!!!!
● Every worker is holding
onto memory for pending
deliveries.
● Every worker is flow
controlling its input
streams.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
Flow
controlled
Flow
controlled
19. Common steps in Stream AnalyticsTechniques: Flow Control
To avoid deadlock, workers must be
able to release memory
This might involve canceling in-flight
work to be retried later
Dataflow streaming workers can spill
pending deliveries to disk to release
memory. Scanned back in later.
A
B C
Worker 1
A
B C
Worker 3
A
B C
Worker 2
Flow
controlled
Flow
controlled
Flow
controlled
20. Common steps in Stream AnalyticsTechniques: Auto Scaling
Adapative autoscaling allows elastic
scaling with load.
Work must by dynamically load
balanced to take advantage of
autoscaling.
21. Common steps in Stream AnalyticsTechniques: Auto Scaling
Never assume fixed workers.
Work ownership can be moved at any time.
All keys are hash sharded, and hash ranges
distributed among workers.
Separate storage from compute
Adds a lot of complexity to exactly once and
consistency protocols!
A
B C
worker23
A
B C
Worker
A
B C
worker32
[0, 3)
[a, f)
[3, a)
[0, 3): 23
[3, a): 32
[a, f): 32
RPCs addressed to
keys, not workers
22. Common steps in Stream AnalyticsLoad Variance: Lesson
Dynamic control is key
No amount of static configuration works
Eventually the universe will outsmart your configuration
24. Common steps in Stream Analytics
End-user
apps
Cloud Composer
IoT
Events
Cloud Pub/Sub Dataflow Streaming
DBs
Cloud AI
Platform
Bigtable Dataflow Batch
Action
Streaming processing options in GCP
BigQuery
BigQuery Streaming API
Machine Learning
Data Warehousing
25. Motivating Example:
Spotify migrating the largest European Hadoop cluster to Dataflow
● Run 80,000+ Dataflow jobs / month
● 90% batch, 10% streaming
Use Dataflow for “everything”
● Music recommendations, Ads targeting
● AB testing, Behavioral analysis, Business metrics
Huge batch jobs:
● 26000 CPUs, 166 TB RAM
● Processing 325 billion rows in 240TB from Bigtable
26. Traditional Distributed Data Processing Architecture
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● Jobs executed on
clusters of VMs
● Job state stored on
network-attached
volumes
● Control plane
orchestrates data
plane
Network
Control plane
VM
State storage State storage State storage
27. Traditional Architecture works well ...
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
… except for Joins and
Group By’s
28. Shuffling key-value pairs
● Starting with <K,V> pairs
placed on different workers
● Goal: co-locate all pairs
with the same Key
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
29. ● Starting with <K,V> pairs
placed on different workers
● Goal: co-locate all pairs
with the same Key
● Workers exchange <K,V>
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
Shuffling key-value pairs
30. Shuffling key-value pairs
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
key1, key 2 key3, key4 key5, key6 key7, key8
● Starting with <K,V> pairs
placed on different workers
● Goal: co-locate all pairs
with the same Key
● Workers exchange <K,V>
● Until everything is sorted
31. Traditional Architecture Requires Manual Tuning
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● When data volumes
exceed dozens of TBs
Network
Control plane
VM
State storage State storage State storage
32. Distributed in-memory Shuffle in batch Cloud Dataflow
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
Pipeline user code Shuffling Operations
33. No tuning required
Dataflow Shuffle is usually
faster than worker-based
shuffle, including those using
SSD-PD.
Better autoscaling keeps
aggregate resource usage
same, but cuts processing
time.
Faster Processing
Runtime of shuffle
Runtime
(mins)
34. Shuffle 300TB+
Dataflow shuffle has been
used to shuffle 300TB+
datasets.
Supporting larger datasets
Dataset size of shuffle
Dataset
size (TB)
35. Storing state
What about streaming pipelines?
Streaming shuffle
Just like in batch, need to group and join
streams
Distributed streaming shuffle
Window data elements
Time window data aggregations need
to be buffered
Until triggering conditions occur
36. Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output
37. Even more state to store on disks in streaming
User code
VM
User code
VM
User code
VM
User code
VM
Shuffle data elements
● Key ranges are assigned
to workers
● Data elements of these
keys is stored on
Persistent Disks
State storage State storage State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
key ABC3 ...
… key DEF5
key DEF6 ...
… key GHI2
Time window data
● Also assigned to workers
● When time windows
close, data processed on
workers
38. Dataflow Streaming Engine
Benefits
● Better supportability
● Less worker resources
● More efficient autoscaling
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle
39. Autoscaling: Even better with separate Compute and State Storage
User code
Streaming engine
Worker
User code
Worker
Window state storage Streaming shuffle
Dataflow with Streaming Engine
User code
VM
User code
VM
State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
Dataflow without Streaming Engine
41. ● Personalization and experimentation platform
● Wanted things to work out-of-the-box
Significant data volumes:
● 25 million user sessions per day
● 2B events per day
Dataflow usage profile:
● Streaming Engine for worryless autoscaling
● Batch processing with FlexRS for cost savings
AB Tasty is using Dataflow Streaming Engine
42. Main Takeaways
Trailing edge watermarks provided a solution for triggering aggregations
The system must be elastic and adaptive
Separating compute from state storage help make stream and batch processing scalable