Advances in Stream Analytics:
Google Cloud Dataflow and Apache Beam
Kyiv, October 5th, 2019
Sergei Sokolenko
Google
Your choices for doing Streaming Processing in Google Cloud
Separating State Storage from Compute
Autoscaling
Making Streaming Easy
Session overview
Google
Cloud
Platform
Our global infrastructure
PLCN (HK, LA) 2019
Faster (US, JP, TW) 2016
Unity (US, JP) 2010
Dunant (US, FR) 2020
Monet (US, BR) 2017
Junior (Rio, Santos) 2018
Tannat (BR, UY, AR) 2018
SJC (JP, HK, SG) 2013
Indigo (SG, ID, AU) 2019
HK-G (HK, GU) 2019 JGA (AU, GU, JP) 2019
Curie (CL, US) 2019
Havfrue (US, IE, DK) 2019
Network
Edge points
of presence
CDN nodes
Mumbai
Singapore
Kuala Lumpur
Sydney
Tokyo
Chennai
Taipei
Seattle
San Francisco
Montréal
Hamburg
Zurich
Madrid
Paris
London
Hong
Kong
Osaka
Toronto
Chicago
Los Angeles
Denver
Dallas
Miami
Atlanta
Washington DC
New York
Rio de Janeiro
São Paulo
Buenos Aires
Munich
Milan
Marseille
Amsterdam
Stockholm
Frankfurt
Dedicated Interconnect
Current regions
and number of zones
Future regions
and number of zones Mumbai
Singapore
Jakarta
Sydney
Tokyo
Osaka
Hong Kong Taiwan
3
3
3
3
3
3 3
3
3 33
3
3
3
4
3
3
Oregon
Los Angeles
Iowa
S. Carolina
N. Virginia
Montréal
São Paulo
Finland
Frankfurt
Zurich
3
Belgium
London
Netherlands
3Seoul
3
3
Salt Lake City
3 3
A comprehensive Big Data platform, not just infrastructure
Data ingestion
at any scale
Reliable streaming
data pipeline
Advanced
analytics
Data warehousing
and data lake
Apache
Beam
Cloud Pub/Sub Cloud
Dataflow
Cloud
Dataproc
BigQuery Cloud
Storage
Data Transfer
Service
Cloud Composer
Cloud IoT
Core
Cloud Dataprep
Cloud AI
Services
Google
Data Studio
Tensorflow Sheets
Storage Transfer
Service
Data Catalog
Data Fusion
Google’s data processing timeline
20122002 2004 2006 2008 2010
MapReduce
GFS Big Table
Dremel
Pregel
FlumeJava
Colossus
Spanner
2014
MillWheel
Dataflow
2016
Apache Beam
Why FlumeJava
Mapreduce
MAP MAP MAP MAP MAP
RED RED RED
(K,V)
(K,V*)
(K,W)
MapReduce can quickly get out of hand
One Google pipeline had
116 stages!
DAGs offer a better abstraction from execution
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
By 2025, more than a
quarter of data created in
the global datasphere will
be real time in nature.
*IDC
9:008:00 14:0013:0012:0011:0010:00
Processing time
8:00
8:00
8:00
Event time
Data Streams and Late Arriving Data
Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output
MillWheel - low-latency, accurate data-processing
Common steps in Stream Analytics
End-user
apps
Cloud Composer
Orchestrate
IoT
Events
Cloud Pub/Sub Dataflow Streaming
DBs
Ingest & distribute
Aggregate,
enrich, detect
Backfill,
reprocess
Cloud AI
Platform
Bigtable Dataflow Batch
Action
Reference architecture of Streaming Processing in GCP
BigQuery
BigQuery Streaming API
Machine Learning
Data Warehousing
What is Beam and Dataflow?
Open source programming model
Unified batch and streaming
Top Apache project by dev@ activity
Runner and language portability
Cloud
Dataflow
Automatic optimizations scale to millions of QPS
Serverless, fully managed data processing
State storage in Shuffle and Streaming Engine
Exactly-once streaming semantics
SDK
The Beam Vision
Input.apply
(Sum.integersPerKey())
Java
input | Sum.PerKey()
Python
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
Sum Per Key
● Separating compute from state storage
● Automatic scaling
● Building Streaming systems can be hard, but it does not have to be
Lessons Learned While Building Cloud Dataflow
Separating compute from
state storage to improve
scalability
Traditional Distributed Data Processing Architecture
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● Jobs executed on
clusters of VMs
● Job state stored on
network-attached
volumes
● Control plane
orchestrates data plane
Network
Control plane
VM
State storage State storage State storage
Traditional Architecture works well ...
Filter
Filter
Join
Group
Filter
Filter
fs://
Databasefs://
Database
… except for Joins and
Group By’s
Shuffling key-value pairs
● Unsorted Data Elements
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
● Unsorted data elements
● Goal: sort data elements
by key
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
Shuffling key-value pairs
● Unsorted data elements
● Goal: sort data elements
by key
● KV pairs need to be
exchanged between
nodes
<key1, record>
<key5, record>
<key3, record>
<key8, record>
<key4, record>
...
<key5, record>
<key5, record>
<key2, record>
<key3, record>
<key8, record>
...
<key3, record>
<key3, record>
<key8, record>
<key3, record>
<key6, record>
...
<key2, record>
<key1, record>
<key5, record>
<key8, record>
<key4, record>
...
Shuffling key-value pairs
● Unsorted data elements
● Goal: sort data elements
by key
● KV pairs need to be
exchanged between
nodes
● Until everything is sorted
Shuffling key-value pairs
<key1, record>
<key1, record>
<key2, record>
<key2, record>
<key2, record>
...
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key3, record>
<key4, record>
...
<key5, record>
<key5, record>
<key5, record>
<key5, record>
<key6, record>
...
<key7, record>
<key8, record>
<key8, record>
<key8, record>
...
key1, key 2 key3, key4 key5, key6 key7, key8
Traditional Architecture Requires Manual Tuning
User code
VM
User code
VM
User code
VM
User code
VM
State storage
● When data volumes
exceed dozens of TBs
Network
Control plane
VM
State storage State storage State storage
Distributed in-memory Shuffle in batch Cloud Dataflow
Compute
Petabit
network
Dataflow Shuffle
Region
Zone ‘a’ Zone ‘b’
Zone ‘c’Distributed
in-memory
file system
Distributed
on-disk
file system
Shuffle
proxy
Autozone placement
No tuning required
Dataflow Shuffle is usually
faster than worker-based
shuffle, including those using
SSD-PD.
Faster Processing
Runtime of shuffle
Runtime
(mins)
Shuffle 200TB+
Dataflow shuffle has been
used to shuffle 200TB+
datasets.
Supporting larger datasets
Dataset size of shuffle
Dataset
size (TB)
Storing state
What about streaming pipelines?
Streaming shuffle
Just like in batch, need to group and join
streams
Distributed streaming shuffle
Window data elements
Late Arriving Data requires buffering
time window data
Accumulate elements until triggering
conditions occur
Goal: Grouping by Event Time into Time Windows
9:00 14:0013:0012:0011:0010:00Event
time
9:00 14:0013:0012:0011:0010:00Processing
time
Input
Output
Even more state to store on disks in streaming
User code
VM
User code
VM
User code
VM
User code
VM
Shuffle data elements
● Key ranges are assigned
to workers
● Data elements of these
keys is stored on
Persistent Disks
State storage State storage State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
key ABC3 ...
… key DEF5
key DEF6 ...
… key GHI2
Time window data
● Also assigned to workers
● When time windows
close, data processed on
workers
Dataflow Streaming Engine
Benefits
● Better supportability
● Less worker resources
● Smoother autoscaling
User code
Streaming engine
Worker
User code
Worker
User code
Worker
User code
Worker
Window state storage Streaming shuffle
Autoscaling: Even better with separate Compute and State Storage
User code
Streaming engine
Worker
User code
Worker
Window state storage Streaming shuffle
Dataflow with Streaming Engine
User code
VM
User code
VM
State storage State storage
key 0000 ...
… key 1234
key 1235 ...
… key ABC2
Dataflow without Streaming Engine
Dataflow with Streaming Engine Dataflow without Streaming Engine
Streaming can be hard,
but does not have to be
We’ve set out to make Streaming
as accessible as Batch.
Easy Stream Analytics in SQL
Group by
Input1
Output
Join
Input2 SELECT input1.*, input2.*
FROM input1 LEFT OUTER JOIN input2
ON input1.Id = input2.Id
Use Dataflow SQL from BigQuery UI:
● Join Pub/Sub Streams with Files or Tables
● Write into BigQuery for dashboarding
● Store Pub/Sub schema in Data Catalog
● Use SQL skills for streaming data processing
Demo
Demo: Streaming Analytics with SQL
Transactions
PubSub
Dataflow BigQuery
SELECT
sr.sales_region,
TUMBLE_START("INTERVAL 5 SECOND") AS period_start,
SUM(tr.payload.amount) as amount
FROM `pubsub.dataflow-sql.transactions` AS tr
INNER JOIN
`bigquery.dataflow-sql.opsdb.us_state_salesregions` AS sr
ON tr.payload.state = sr.state_code
GROUP BY
sr.sales_region,
TUMBLE(tr.event_timestamp, "INTERVAL 5 SECOND")
PubSub topic
Streaming SQL
pipeline
Table
Table
Google Cloud offers both infrastructure-as-a-service as well as fully managed services
Separating compute from state storage help make stream and batch processing scalable
SQL brings complexity of Streaming Processing way down
Main takeaways
Thank you!

Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud Dataflow deep-dive"

  • 1.
    Advances in StreamAnalytics: Google Cloud Dataflow and Apache Beam Kyiv, October 5th, 2019 Sergei Sokolenko Google
  • 2.
    Your choices fordoing Streaming Processing in Google Cloud Separating State Storage from Compute Autoscaling Making Streaming Easy Session overview
  • 3.
    Google Cloud Platform Our global infrastructure PLCN(HK, LA) 2019 Faster (US, JP, TW) 2016 Unity (US, JP) 2010 Dunant (US, FR) 2020 Monet (US, BR) 2017 Junior (Rio, Santos) 2018 Tannat (BR, UY, AR) 2018 SJC (JP, HK, SG) 2013 Indigo (SG, ID, AU) 2019 HK-G (HK, GU) 2019 JGA (AU, GU, JP) 2019 Curie (CL, US) 2019 Havfrue (US, IE, DK) 2019 Network Edge points of presence CDN nodes Mumbai Singapore Kuala Lumpur Sydney Tokyo Chennai Taipei Seattle San Francisco Montréal Hamburg Zurich Madrid Paris London Hong Kong Osaka Toronto Chicago Los Angeles Denver Dallas Miami Atlanta Washington DC New York Rio de Janeiro São Paulo Buenos Aires Munich Milan Marseille Amsterdam Stockholm Frankfurt Dedicated Interconnect Current regions and number of zones Future regions and number of zones Mumbai Singapore Jakarta Sydney Tokyo Osaka Hong Kong Taiwan 3 3 3 3 3 3 3 3 3 33 3 3 3 4 3 3 Oregon Los Angeles Iowa S. Carolina N. Virginia Montréal São Paulo Finland Frankfurt Zurich 3 Belgium London Netherlands 3Seoul 3 3 Salt Lake City 3 3
  • 4.
    A comprehensive BigData platform, not just infrastructure Data ingestion at any scale Reliable streaming data pipeline Advanced analytics Data warehousing and data lake Apache Beam Cloud Pub/Sub Cloud Dataflow Cloud Dataproc BigQuery Cloud Storage Data Transfer Service Cloud Composer Cloud IoT Core Cloud Dataprep Cloud AI Services Google Data Studio Tensorflow Sheets Storage Transfer Service Data Catalog Data Fusion
  • 5.
    Google’s data processingtimeline 20122002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Apache Beam
  • 6.
    Why FlumeJava Mapreduce MAP MAPMAP MAP MAP RED RED RED (K,V) (K,V*) (K,W)
  • 7.
    MapReduce can quicklyget out of hand One Google pipeline had 116 stages!
  • 8.
    DAGs offer abetter abstraction from execution Filter Filter Join Group Filter Filter fs:// Databasefs:// Database
  • 9.
    By 2025, morethan a quarter of data created in the global datasphere will be real time in nature. *IDC
  • 10.
  • 11.
    Goal: Grouping byEvent Time into Time Windows 9:00 14:0013:0012:0011:0010:00Event time 9:00 14:0013:0012:0011:0010:00Processing time Input Output
  • 12.
    MillWheel - low-latency,accurate data-processing
  • 13.
    Common steps inStream Analytics End-user apps Cloud Composer Orchestrate IoT Events Cloud Pub/Sub Dataflow Streaming DBs Ingest & distribute Aggregate, enrich, detect Backfill, reprocess Cloud AI Platform Bigtable Dataflow Batch Action Reference architecture of Streaming Processing in GCP BigQuery BigQuery Streaming API Machine Learning Data Warehousing
  • 14.
    What is Beamand Dataflow? Open source programming model Unified batch and streaming Top Apache project by dev@ activity Runner and language portability Cloud Dataflow Automatic optimizations scale to millions of QPS Serverless, fully managed data processing State storage in Shuffle and Streaming Engine Exactly-once streaming semantics SDK
  • 15.
    The Beam Vision Input.apply (Sum.integersPerKey()) Java input| Sum.PerKey() Python stats.Sum(s, input) Go SELECT key, SUM(value) FROM input GROUP BY key SQL Cloud Dataflow Apache Spark Apache Flink Apache Apex Gearpump Apache Samza Apache Nemo (incubating) IBM Streams Sum Per Key
  • 16.
    ● Separating computefrom state storage ● Automatic scaling ● Building Streaming systems can be hard, but it does not have to be Lessons Learned While Building Cloud Dataflow
  • 17.
    Separating compute from statestorage to improve scalability
  • 18.
    Traditional Distributed DataProcessing Architecture User code VM User code VM User code VM User code VM State storage ● Jobs executed on clusters of VMs ● Job state stored on network-attached volumes ● Control plane orchestrates data plane Network Control plane VM State storage State storage State storage
  • 19.
    Traditional Architecture workswell ... Filter Filter Join Group Filter Filter fs:// Databasefs:// Database … except for Joins and Group By’s
  • 20.
    Shuffling key-value pairs ●Unsorted Data Elements <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ...
  • 21.
    ● Unsorted dataelements ● Goal: sort data elements by key <key1, record> <key1, record> <key2, record> <key2, record> <key2, record> ... <key3, record> <key3, record> <key3, record> <key3, record> <key3, record> <key4, record> ... <key5, record> <key5, record> <key5, record> <key5, record> <key6, record> ... <key7, record> <key8, record> <key8, record> <key8, record> ... Shuffling key-value pairs
  • 22.
    ● Unsorted dataelements ● Goal: sort data elements by key ● KV pairs need to be exchanged between nodes <key1, record> <key5, record> <key3, record> <key8, record> <key4, record> ... <key5, record> <key5, record> <key2, record> <key3, record> <key8, record> ... <key3, record> <key3, record> <key8, record> <key3, record> <key6, record> ... <key2, record> <key1, record> <key5, record> <key8, record> <key4, record> ... Shuffling key-value pairs
  • 23.
    ● Unsorted dataelements ● Goal: sort data elements by key ● KV pairs need to be exchanged between nodes ● Until everything is sorted Shuffling key-value pairs <key1, record> <key1, record> <key2, record> <key2, record> <key2, record> ... <key3, record> <key3, record> <key3, record> <key3, record> <key3, record> <key4, record> ... <key5, record> <key5, record> <key5, record> <key5, record> <key6, record> ... <key7, record> <key8, record> <key8, record> <key8, record> ... key1, key 2 key3, key4 key5, key6 key7, key8
  • 24.
    Traditional Architecture RequiresManual Tuning User code VM User code VM User code VM User code VM State storage ● When data volumes exceed dozens of TBs Network Control plane VM State storage State storage State storage
  • 25.
    Distributed in-memory Shufflein batch Cloud Dataflow Compute Petabit network Dataflow Shuffle Region Zone ‘a’ Zone ‘b’ Zone ‘c’Distributed in-memory file system Distributed on-disk file system Shuffle proxy Autozone placement
  • 26.
    No tuning required DataflowShuffle is usually faster than worker-based shuffle, including those using SSD-PD. Faster Processing Runtime of shuffle Runtime (mins)
  • 27.
    Shuffle 200TB+ Dataflow shufflehas been used to shuffle 200TB+ datasets. Supporting larger datasets Dataset size of shuffle Dataset size (TB)
  • 28.
    Storing state What aboutstreaming pipelines? Streaming shuffle Just like in batch, need to group and join streams Distributed streaming shuffle Window data elements Late Arriving Data requires buffering time window data Accumulate elements until triggering conditions occur
  • 29.
    Goal: Grouping byEvent Time into Time Windows 9:00 14:0013:0012:0011:0010:00Event time 9:00 14:0013:0012:0011:0010:00Processing time Input Output
  • 30.
    Even more stateto store on disks in streaming User code VM User code VM User code VM User code VM Shuffle data elements ● Key ranges are assigned to workers ● Data elements of these keys is stored on Persistent Disks State storage State storage State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 key ABC3 ... … key DEF5 key DEF6 ... … key GHI2 Time window data ● Also assigned to workers ● When time windows close, data processed on workers
  • 31.
    Dataflow Streaming Engine Benefits ●Better supportability ● Less worker resources ● Smoother autoscaling User code Streaming engine Worker User code Worker User code Worker User code Worker Window state storage Streaming shuffle
  • 32.
    Autoscaling: Even betterwith separate Compute and State Storage User code Streaming engine Worker User code Worker Window state storage Streaming shuffle Dataflow with Streaming Engine User code VM User code VM State storage State storage key 0000 ... … key 1234 key 1235 ... … key ABC2 Dataflow without Streaming Engine
  • 33.
    Dataflow with StreamingEngine Dataflow without Streaming Engine
  • 34.
    Streaming can behard, but does not have to be
  • 35.
    We’ve set outto make Streaming as accessible as Batch.
  • 36.
    Easy Stream Analyticsin SQL Group by Input1 Output Join Input2 SELECT input1.*, input2.* FROM input1 LEFT OUTER JOIN input2 ON input1.Id = input2.Id Use Dataflow SQL from BigQuery UI: ● Join Pub/Sub Streams with Files or Tables ● Write into BigQuery for dashboarding ● Store Pub/Sub schema in Data Catalog ● Use SQL skills for streaming data processing
  • 37.
  • 38.
    Demo: Streaming Analyticswith SQL Transactions PubSub Dataflow BigQuery SELECT sr.sales_region, TUMBLE_START("INTERVAL 5 SECOND") AS period_start, SUM(tr.payload.amount) as amount FROM `pubsub.dataflow-sql.transactions` AS tr INNER JOIN `bigquery.dataflow-sql.opsdb.us_state_salesregions` AS sr ON tr.payload.state = sr.state_code GROUP BY sr.sales_region, TUMBLE(tr.event_timestamp, "INTERVAL 5 SECOND") PubSub topic Streaming SQL pipeline Table Table
  • 39.
    Google Cloud offersboth infrastructure-as-a-service as well as fully managed services Separating compute from state storage help make stream and batch processing scalable SQL brings complexity of Streaming Processing way down Main takeaways
  • 40.