Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

Advances in Stream Analytics:
Google Cloud Dataflow and Apache Beam
Kyiv, October 5th, 2019
Sergei Sokolenko
Google

Your choices for doing Streaming Processing in Google Cloud
Separating State Storage from Compute
Autoscaling
Making Streaming Easy
Session overview

Google
Cloud
Platform
Our global infrastructure
PLCN (HK, LA) 2019
Faster (US, JP, TW) 2016
Unity (US, JP) 2010
Dunant (US, FR) 2020
Monet (US, BR) 2017
Junior (Rio, Santos) 2018
Tannat (BR, UY, AR) 2018
SJC (JP, HK, SG) 2013
Indigo (SG, ID, AU) 2019
HK-G (HK, GU) 2019 JGA (AU, GU, JP) 2019
Curie (CL, US) 2019
Havfrue (US, IE, DK) 2019
Network
Edge points
of presence
CDN nodes
Mumbai
Singapore
Kuala Lumpur
Sydney
Tokyo
Chennai
Taipei
Seattle
San Francisco
Montréal
Hamburg
Zurich
Madrid
Paris
London
Hong
Kong
Osaka
Toronto
Chicago
Los Angeles
Denver
Dallas
Miami
Atlanta
Washington DC
New York
Rio de Janeiro
São Paulo
Buenos Aires
Munich
Milan
Marseille
Amsterdam
Stockholm
Frankfurt
Dedicated Interconnect
Current regions
and number of zones
Future regions
and number of zones Mumbai
Singapore
Jakarta
Sydney
Tokyo
Osaka
Hong Kong Taiwan
3
3
3
3
3
3 3
3
3 33
3
3
3
4
3
3
Oregon
Los Angeles
Iowa
S. Carolina
N. Virginia
Montréal
São Paulo
Finland
Frankfurt
Zurich
3
Belgium
London
Netherlands
3Seoul
3
3
Salt Lake City
3 3

A comprehensive Big Data platform, not just infrastructure
Data ingestion
at any scale
Reliable streaming
data pipeline
Advanced
analytics
Data warehousing
and data lake
Apache
Beam
Cloud Pub/Sub Cloud
Dataﬂow
Cloud
Dataproc
BigQuery Cloud
Storage
Data Transfer
Service
Cloud Composer
Cloud IoT
Core
Cloud Dataprep
Cloud AI
Services
Google
Data Studio
Tensorﬂow Sheets
Storage Transfer
Service
Data Catalog
Data Fusion

Google’s data processing timeline
20122002 2004 2006 2008 2010
MapReduce
GFS Big Table
Dremel
Pregel
FlumeJava
Colossus
Spanner
2014
MillWheel
Dataﬂow
2016
Apache Beam

Why FlumeJava
Mapreduce
MAP MAP MAP MAP MAP
RED RED RED
(K,V)
(K,V*)
(K,W)

MapReduce can quickly get out of hand
One Google pipeline had
116 stages!

DAGs offer a better abstraction from execution
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database

By 2025, more than a
quarter of data created in
the global datasphere will
be real time in nature.
*IDC

9:008:00 14:0013:0012:0011:0010:00
Processing time
8:00
8:00
8:00
Event time
Data Streams and Late Arriving Data

Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output

MillWheel - low-latency, accurate data-processing

Common steps in Stream Analytics
End-user
apps
Cloud Composer
Orchestrate
IoT
Events
Cloud Pub/Sub Dataﬂow Streaming
DBs
Ingest & distribute
Aggregate,
enrich, detect
Backfill,
reprocess
Cloud AI
Platform
Bigtable Dataﬂow Batch
Action
Reference architecture of Streaming Processing in GCP
BigQuery
BigQuery Streaming API
Machine Learning
Data Warehousing

What is Beam and Dataflow?
Open source programming model
Uniﬁed batch and streaming
Top Apache project by dev@ activity
Runner and language portability
Cloud
Dataflow
Automatic optimizations scale to millions of QPS
Serverless, fully managed data processing
State storage in Shuﬄe and Streaming Engine
Exactly-once streaming semantics
SDK

The Beam Vision
Input.apply
(Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key

● Separating compute from state storage
● Automatic scaling
● Building Streaming systems can be hard, but it does not have to be
Lessons Learned While Building Cloud Dataflow

Separating compute from
state storage to improve
scalability

Traditional Distributed Data Processing Architecture
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● Jobs executed on
clusters of VMs
● Job state stored on
network-attached
volumes
● Control plane
orchestrates data plane
Network
Control plane
VM
State storage State storage State storage

Traditional Architecture works well ...
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
… except for Joins and
Group By’s

Shuffling key-value pairs
● Unsorted Data Elements
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...

● Unsorted data elements
● Goal: sort data elements
by key
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...

by key
● KV pairs need to be
exchanged between
nodes
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...

by key
● KV pairs need to be
exchanged between
nodes
● Until everything is sorted
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
key1, key 2 key3, key4 key5, key6 key7, key8

Traditional Architecture Requires Manual Tuning
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● When data volumes
exceed dozens of TBs
Network
Control plane
VM
State storage State storage State storage

Distributed in-memory Shuffle in batch Cloud Dataflow
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement

No tuning required
Dataflow Shuffle is usually
faster than worker-based
shuffle, including those using
SSD-PD.
Faster Processing
Runtime of shuffle
Runtime
(mins)

Shuffle 200TB+
Dataflow shuffle has been
used to shuffle 200TB+
datasets.
Supporting larger datasets
Dataset size of shuffle
Dataset
size (TB)

Storing state
What about streaming pipelines?
Streaming shuﬄe
Just like in batch, need to group and join
streams
Distributed streaming shuﬄe
Window data elements
Late Arriving Data requires buffering
time window data
Accumulate elements until triggering
conditions occur

Even more state to store on disks in streaming
User code
VM
User code
VM
User code
VM
User code
VM
Shuﬄe data elements
● Key ranges are assigned
to workers
● Data elements of these
keys is stored on
Persistent Disks
State storage State storage State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
key ABC3 ...
… key DEF5
key DEF6 ...
… key GHI2
Time window data
● Also assigned to workers
● When time windows
close, data processed on
workers

Dataflow Streaming Engine
Beneﬁts
● Better supportability
● Less worker resources
● Smoother autoscaling
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle

Autoscaling: Even better with separate Compute and State Storage
User code
Streaming engine
Worker
User code
Worker
Window state storage Streaming shuffle
Dataﬂow with Streaming Engine
User code
VM
User code
VM
State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
Dataﬂow without Streaming Engine

Dataﬂow with Streaming Engine Dataﬂow without Streaming Engine

Streaming can be hard,
but does not have to be

We’ve set out to make Streaming
as accessible as Batch.

Easy Stream Analytics in SQL
Group by
Input1
Output
Join
Input2 SELECT input1.*, input2.*
FROM input1 LEFT OUTER JOIN input2
ON input1.Id = input2.Id
Use Dataflow SQL from BigQuery UI:
● Join Pub/Sub Streams with Files or Tables
● Write into BigQuery for dashboarding
● Store Pub/Sub schema in Data Catalog
● Use SQL skills for streaming data processing

Demo: Streaming Analytics with SQL
Transactions
PubSub
Dataflow BigQuery
SELECT
sr.sales_region,
TUMBLE_START("INTERVAL 5 SECOND") AS period_start,
SUM(tr.payload.amount) as amount
FROM `pubsub.dataﬂow-sql.transactions` AS tr
INNER JOIN
`bigquery.dataﬂow-sql.opsdb.us_state_salesregions` AS sr
ON tr.payload.state = sr.state_code
GROUP BY
sr.sales_region,
TUMBLE(tr.event_timestamp, "INTERVAL 5 SECOND")
PubSub topic
Streaming SQL
pipeline
Table
Table

Google Cloud offers both infrastructure-as-a-service as well as fully managed services
Separating compute from state storage help make stream and batch processing scalable
SQL brings complexity of Streaming Processing way down
Main takeaways

Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

More Related Content

What's hot

Similar to Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

More from Fwdays

Recently uploaded

Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"