Big data processing with PubSub, Dataflow, and BigQuery

BIG DATA PROCESSING WITH PUB/SUB,
DATAFLOW AND BIGQUERY
Thuyen Ho – Data Engineer @ KNOREX
© 2018 KNOREX

© 2018 KNOREX
Established in 2010, Knorex provides Precision Performance Marketing products and solutions to leading
trading desks, agencies and brands.
Offices and direct business presence across US, UK, Australia, China, India and Southeast Asia (SEA)
ABOUT KNOREX
8
OFFICES
110+
STAFFS
. .
.
. ....

© 2018 KNOREX
3
PROBLEM STATEMENT
Ingest large volume of streaming user data,
transform based on ever changing parameters, and
store them in a database in real time. This data will be
used for 2 purpose:
1. Targeting users in real time for advertising
campaigns
2. Aggregation of data for estimation of campaign
reach
Third-
party
partner
KNOREX
DMP
Ingest stream events
• QPS: ~1500 - 2000 events
• Event size: 50KB – 100KB
• Data Volume: ~1TB a day
Historical data
• Reprocess: ~30TB each day
• Aggregate: ~60TB each day

© 2018 KNOREX
4
• Quick Introduction To Pub/Sub, Dataflow and BigQuery
• KNOREX Approach
• Q&A
AGENDA

5
Quick Introduction To Pub/Sub, Dataflow and BigQuery

© 2018 KNOREX
6
SERVERLESS STREAM PROCESSING PIPELINE WITH GCP
Dataflow
stream processing
BigQuery
analytics
engine
Data events Processed data
Pub/Sub
messaging queue

© 2018 KNOREX
7
Cloud Pub/Sub is an asynchronous messaging service designed to be highly
reliable and scalable.
CLOUD PUB/SUB

© 2018 KNOREX
8
CLOUD PUB/SUB – PULL SUBSCRIPTION

© 2018 KNOREX
9
CLOUD PUB/SUB – PUSH SUBSCRIPTION

© 2018 KNOREX1
0
Lambda architecture is a data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch and stream-processing methods. (source:
wikipedia.org)
To balance:
• Latency
• Throughput
• Fault-tolerance
LAMBDA ARCHITECTURE

© 2018 KNOREX1
1
DATA PROCESSING - TRANSFORMS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing

© 2018 KNOREX1
2
Cloud Dataflow is a fully-managed service, autoscaling execution environment for
Beam pipelines.
Beams supports the following language-specific SDKs: Java, Python and Go
CLOUD DATAFLOW
Implement batch and streaming data
processing jobs that run on any
execution engine.
great execution environment

© 2018 KNOREX1
3
BEAM ABSTRACTIONS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing
Bounded / Unbounded
PCollection
PTransform
PTransform
PTransform
PTransform
Pipeline

© 2018 KNOREX1
4
BEAM - FIXED TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1 30s window 2

© 2018 KNOREX1
5
BEAM – SLIDING TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1
30s window 2

© 2018 KNOREX1
6
BEAM – SESSION WINDOWS
1
2
Processing time
2
4
7
window 0
00:00:00 00:00:30 00:01:00 00:01:30
window 1 window 2
7
4
2
2 2 2
2 2
2
4
4 4
Gap duration

© 2018 KNOREX1
7
A fast, highly scalable, cost-effective, and fully managed enterprise data warehouse for
analytics.
Some of the features:
• Serverless
• Real-time Analytics
• Standard SQL
• Storage and Compute Separation
• Flexible Data Ingestion
• Petabyte Scale
CLOUD BIGQUERY

© 2018 KNOREX1
8
BIGQUERY STORAGE IS COLUMNAR
Column1 Column2 Column3
Each column in sperate. No
Indexes or key is required.

© 2018 KNOREX1
9
INGESTION-TIME PARTITIONED TABLE
19
SELECT Column1, Column2
FROM `database.table_name`
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE

© 2018 KNOREX2
0
INGESTION-TIME PARTITIONED TABLE
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE

© 2018 KNOREX2
1
PARTITIONED TABLE
Column1 Column2
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
Column3
Partitioned based on data in a
specified TIMESTAMP or DATE
column.
WHERE Column3 >= "2018-12-01" AND Column3 < "2018-12-03"

© 2018 KNOREX2
3
ARCHITECTURE – STREAMING PIPELINE
Third-Party partner Processing and analytics CMS
& RTB engine
API gateway
Cloud Load
Balancing
Data warehouse
BigQuery
Sharding +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
API
Compute Engine
Autoscaling
Audience
Cloud Bigtable
3 regions
CMS
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
Device topic
Python script
Compute Engine
Autoscaling
Event ingest

© 2018 KNOREX2
4
ARCHITECTURE – EVENT INGEST
GCE run code with auto-scaling
instances.
it receives 1500 events a sec from
our partner.
API endpoint will put events into two
separate topics: cookie and device.
Cloud Load
Balancing
API
Compute Engine
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
1500 events a sec

© 2018 KNOREX2
5
ARCHITECTURE – PROCESSING AND ANALYTICS
25
Cloud Dataflow transforms and
enriches raw events in real time
and inserts both processed data
into BigQuery as well as send them
to RTB engine through Pub/Sub.
Each region has a subscription to
pull data from segment topic, then
insert into BigTable.
BigQuery is a warehouse for
analytics. Tables are partitioned by
ingestion time. It keep data in 60
days.
Data warehouse
BigQuery
Partition +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
segment topic Asia region
Compute
Engine
Cloud
BigTable
JP region
Compute
Engine
Cloud
BigTable
US region
Compute
Engine
Cloud
BigTable
CMS
KNX RTB Engine

© 2018 KNOREX2
6
ARCHITECTURE – BATCH PIPELINE
The Dataflow also takes data
from BigQuery in the past 30
days and reprocess again in
batch job.
Cloud Dataflow
batch processing
BigQuery
analytics
engine
Batch pipeline Batch loads
BigQuery
analytics
engine
Pub/Sub

29
Building Resilient Streaming Systems Lab

Big data processing with PubSub, Dataflow, and BigQuery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big data processing with PubSub, Dataflow, and BigQuery

Similar to Big data processing with PubSub, Dataflow, and BigQuery (20)

Recently uploaded

Recently uploaded (20)

Big data processing with PubSub, Dataflow, and BigQuery