SlideShare a Scribd company logo
BIG DATA PROCESSING WITH PUB/SUB,
DATAFLOW AND BIGQUERY
Thuyen Ho – Data Engineer @ KNOREX
© 2018 KNOREX
© 2018 KNOREX
Established in 2010, Knorex provides Precision Performance Marketing products and solutions to leading
trading desks, agencies and brands.
Offices and direct business presence across US, UK, Australia, China, India and Southeast Asia (SEA)
ABOUT KNOREX
8
OFFICES
110+
STAFFS
. .
.
. ....
© 2018 KNOREX
3
PROBLEM STATEMENT
Ingest large volume of streaming user data,
transform based on ever changing parameters, and
store them in a database in real time. This data will be
used for 2 purpose:
1. Targeting users in real time for advertising
campaigns
2. Aggregation of data for estimation of campaign
reach
Third-
party
partner
KNOREX
DMP
Ingest stream events
• QPS: ~1500 - 2000 events
• Event size: 50KB – 100KB
• Data Volume: ~1TB a day
Historical data
• Reprocess: ~30TB each day
• Aggregate: ~60TB each day
© 2018 KNOREX
4
• Quick Introduction To Pub/Sub, Dataflow and BigQuery
• KNOREX Approach
• Q&A
AGENDA
5
Quick Introduction To Pub/Sub, Dataflow and BigQuery
© 2018 KNOREX
6
SERVERLESS STREAM PROCESSING PIPELINE WITH GCP
Dataflow
stream processing
BigQuery
analytics
engine
Data events Processed data
Pub/Sub
messaging queue
© 2018 KNOREX
7
Cloud Pub/Sub is an asynchronous messaging service designed to be highly
reliable and scalable.
CLOUD PUB/SUB
© 2018 KNOREX
8
CLOUD PUB/SUB – PULL SUBSCRIPTION
© 2018 KNOREX
9
CLOUD PUB/SUB – PUSH SUBSCRIPTION
© 2018 KNOREX1
0
Lambda architecture is a data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch and stream-processing methods. (source:
wikipedia.org)
To balance:
• Latency
• Throughput
• Fault-tolerance
LAMBDA ARCHITECTURE
© 2018 KNOREX1
1
DATA PROCESSING - TRANSFORMS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing
© 2018 KNOREX1
2
Cloud Dataflow is a fully-managed service, autoscaling execution environment for
Beam pipelines.
Beams supports the following language-specific SDKs: Java, Python and Go
CLOUD DATAFLOW
Implement batch and streaming data
processing jobs that run on any
execution engine.
great execution environment
© 2018 KNOREX1
3
BEAM ABSTRACTIONS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing
Bounded / Unbounded
PCollection
PTransform
PTransform
PTransform
PTransform
Pipeline
© 2018 KNOREX1
4
BEAM - FIXED TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1 30s window 2
© 2018 KNOREX1
5
BEAM – SLIDING TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1
30s window 2
© 2018 KNOREX1
6
BEAM – SESSION WINDOWS
1
2
Processing time
2
4
7
window 0
00:00:00 00:00:30 00:01:00 00:01:30
window 1 window 2
7
4
2
2 2 2
2 2
2
4
4 4
Gap duration
© 2018 KNOREX1
7
A fast, highly scalable, cost-effective, and fully managed enterprise data warehouse for
analytics.
Some of the features:
• Serverless
• Real-time Analytics
• Standard SQL
• Storage and Compute Separation
• Flexible Data Ingestion
• Petabyte Scale
CLOUD BIGQUERY
© 2018 KNOREX1
8
BIGQUERY STORAGE IS COLUMNAR
Column1 Column2 Column3
Each column in sperate. No
Indexes or key is required.
© 2018 KNOREX1
9
INGESTION-TIME PARTITIONED TABLE
19
Column1 Column2 Column3
SELECT Column1, Column2
FROM `database.table_name`
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE
© 2018 KNOREX2
0
INGESTION-TIME PARTITIONED TABLE
Column1 Column2 Column3
SELECT Column1, Column2
FROM `database.table_name`
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE
© 2018 KNOREX2
1
PARTITIONED TABLE
Column1 Column2
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
Column3
Partitioned based on data in a
specified TIMESTAMP or DATE
column.
SELECT Column1, Column2
FROM `database.table_name`
WHERE Column3 >= "2018-12-01" AND Column3 < "2018-12-03"
22
KNOREX APPROACH
© 2018 KNOREX2
3
ARCHITECTURE – STREAMING PIPELINE
Third-Party partner Processing and analytics CMS
& RTB engine
API gateway
Cloud Load
Balancing
Data warehouse
BigQuery
Sharding +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
API
Compute Engine
Autoscaling
Audience
Cloud Bigtable
3 regions
CMS
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
Device topic
Python script
Compute Engine
Autoscaling
Event ingest
© 2018 KNOREX2
4
ARCHITECTURE – EVENT INGEST
GCE run code with auto-scaling
instances.
it receives 1500 events a sec from
our partner.
API endpoint will put events into two
separate topics: cookie and device.
Cloud Load
Balancing
API
Compute Engine
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
1500 events a sec
© 2018 KNOREX2
5
ARCHITECTURE – PROCESSING AND ANALYTICS
25
Cloud Dataflow transforms and
enriches raw events in real time
and inserts both processed data
into BigQuery as well as send them
to RTB engine through Pub/Sub.
Each region has a subscription to
pull data from segment topic, then
insert into BigTable.
BigQuery is a warehouse for
analytics. Tables are partitioned by
ingestion time. It keep data in 60
days.
Data warehouse
BigQuery
Partition +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
segment topic Asia region
Compute
Engine
Cloud
BigTable
JP region
Compute
Engine
Cloud
BigTable
US region
Compute
Engine
Cloud
BigTable
CMS
KNX RTB Engine
© 2018 KNOREX2
6
ARCHITECTURE – BATCH PIPELINE
The Dataflow also takes data
from BigQuery in the past 30
days and reprocess again in
batch job.
Cloud Dataflow
batch processing
BigQuery
analytics
engine
Batch pipeline Batch loads
BigQuery
analytics
engine
Pub/Sub
© 2018 KNOREX2
7
DATAFLOW – PIPELINE VISUALIZATION
28
Q&A
29
Building Resilient Streaming Systems Lab
30
THANK YOU
KNOR E X.COM

More Related Content

What's hot

Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
confluent
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Big Data Spain
 
How to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jHow to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4j
GraphRM
 
IoT at Google Scale
IoT at Google ScaleIoT at Google Scale
IoT at Google Scale
James Chittenden
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Big Data Spain
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
DataWorks Summit
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
kbajda
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
confluent
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
InfluxData
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free Life
SingleStore
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
Mathieu Dumoulin
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Big Data Spain
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
MapR Technologies
 
Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage
Jedha Bootcamp
 
Google cloud big data summit master gcp big data summit la - 10-20-2015
Google cloud big data summit   master gcp big data summit la - 10-20-2015Google cloud big data summit   master gcp big data summit la - 10-20-2015
Google cloud big data summit master gcp big data summit la - 10-20-2015
Raj Babu
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and Beyond
SingleStore
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
DataWorks Summit
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
SingleStore
 

What's hot (20)

Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
How to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jHow to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4j
 
IoT at Google Scale
IoT at Google ScaleIoT at Google Scale
IoT at Google Scale
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free Life
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage
 
Google cloud big data summit master gcp big data summit la - 10-20-2015
Google cloud big data summit   master gcp big data summit la - 10-20-2015Google cloud big data summit   master gcp big data summit la - 10-20-2015
Google cloud big data summit master gcp big data summit la - 10-20-2015
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and Beyond
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
 

Similar to Big data processing with PubSub, Dataflow, and BigQuery

Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
Software Guru
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseData Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Rittman Analytics
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
Attunity
 
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
Gleb Otochkin
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Amazon Web Services
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
HostedbyConfluent
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
Matillion
 
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageBuilding Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Databricks
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
SingleStore
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage
Charly Mostert
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
Databricks
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Cloudera, Inc.
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
Kent Graziano
 
Veritas + MongoDB
Veritas + MongoDBVeritas + MongoDB
Veritas + MongoDB
MongoDB
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Denodo
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
Kimmo Kantojärvi
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
Abdelkrim Hadjidj
 

Similar to Big data processing with PubSub, Dataflow, and BigQuery (20)

Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseData Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageBuilding Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Veritas + MongoDB
Veritas + MongoDBVeritas + MongoDB
Veritas + MongoDB
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 

Recently uploaded

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 

Recently uploaded (20)

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

Big data processing with PubSub, Dataflow, and BigQuery

  • 1. BIG DATA PROCESSING WITH PUB/SUB, DATAFLOW AND BIGQUERY Thuyen Ho – Data Engineer @ KNOREX © 2018 KNOREX
  • 2. © 2018 KNOREX Established in 2010, Knorex provides Precision Performance Marketing products and solutions to leading trading desks, agencies and brands. Offices and direct business presence across US, UK, Australia, China, India and Southeast Asia (SEA) ABOUT KNOREX 8 OFFICES 110+ STAFFS . . . . ....
  • 3. © 2018 KNOREX 3 PROBLEM STATEMENT Ingest large volume of streaming user data, transform based on ever changing parameters, and store them in a database in real time. This data will be used for 2 purpose: 1. Targeting users in real time for advertising campaigns 2. Aggregation of data for estimation of campaign reach Third- party partner KNOREX DMP Ingest stream events • QPS: ~1500 - 2000 events • Event size: 50KB – 100KB • Data Volume: ~1TB a day Historical data • Reprocess: ~30TB each day • Aggregate: ~60TB each day
  • 4. © 2018 KNOREX 4 • Quick Introduction To Pub/Sub, Dataflow and BigQuery • KNOREX Approach • Q&A AGENDA
  • 5. 5 Quick Introduction To Pub/Sub, Dataflow and BigQuery
  • 6. © 2018 KNOREX 6 SERVERLESS STREAM PROCESSING PIPELINE WITH GCP Dataflow stream processing BigQuery analytics engine Data events Processed data Pub/Sub messaging queue
  • 7. © 2018 KNOREX 7 Cloud Pub/Sub is an asynchronous messaging service designed to be highly reliable and scalable. CLOUD PUB/SUB
  • 8. © 2018 KNOREX 8 CLOUD PUB/SUB – PULL SUBSCRIPTION
  • 9. © 2018 KNOREX 9 CLOUD PUB/SUB – PUSH SUBSCRIPTION
  • 10. © 2018 KNOREX1 0 Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. (source: wikipedia.org) To balance: • Latency • Throughput • Fault-tolerance LAMBDA ARCHITECTURE
  • 11. © 2018 KNOREX1 1 DATA PROCESSING - TRANSFORMS Storage Group Aggregate Filter Transform Input Data Output Data Data Processing
  • 12. © 2018 KNOREX1 2 Cloud Dataflow is a fully-managed service, autoscaling execution environment for Beam pipelines. Beams supports the following language-specific SDKs: Java, Python and Go CLOUD DATAFLOW Implement batch and streaming data processing jobs that run on any execution engine. great execution environment
  • 13. © 2018 KNOREX1 3 BEAM ABSTRACTIONS Storage Group Aggregate Filter Transform Input Data Output Data Data Processing Bounded / Unbounded PCollection PTransform PTransform PTransform PTransform Pipeline
  • 14. © 2018 KNOREX1 4 BEAM - FIXED TIME WINDOWS 1 7 2 1 8 Unbounded events Processing time 3 8 6 3 5 3 8 8 2 4 2 1 9 3 7 30s window 0 00:00:00 00:00:30 00:01:00 00:01:30 30s window 1 30s window 2
  • 15. © 2018 KNOREX1 5 BEAM – SLIDING TIME WINDOWS 1 7 2 1 8 Unbounded events Processing time 3 8 6 3 5 3 8 8 2 4 2 1 9 3 7 30s window 0 00:00:00 00:00:30 00:01:00 00:01:30 30s window 1 30s window 2
  • 16. © 2018 KNOREX1 6 BEAM – SESSION WINDOWS 1 2 Processing time 2 4 7 window 0 00:00:00 00:00:30 00:01:00 00:01:30 window 1 window 2 7 4 2 2 2 2 2 2 2 4 4 4 Gap duration
  • 17. © 2018 KNOREX1 7 A fast, highly scalable, cost-effective, and fully managed enterprise data warehouse for analytics. Some of the features: • Serverless • Real-time Analytics • Standard SQL • Storage and Compute Separation • Flexible Data Ingestion • Petabyte Scale CLOUD BIGQUERY
  • 18. © 2018 KNOREX1 8 BIGQUERY STORAGE IS COLUMNAR Column1 Column2 Column3 Each column in sperate. No Indexes or key is required.
  • 19. © 2018 KNOREX1 9 INGESTION-TIME PARTITIONED TABLE 19 Column1 Column2 Column3 SELECT Column1, Column2 FROM `database.table_name` WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03" 2018-12-01 00:00:00 2018-12-01 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-03 00:00:00 2018-12-03 00:00:00 _PARTITIONTIME 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 _PARTITIONDATE
  • 20. © 2018 KNOREX2 0 INGESTION-TIME PARTITIONED TABLE Column1 Column2 Column3 SELECT Column1, Column2 FROM `database.table_name` WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03" 2018-12-01 00:00:00 2018-12-01 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-03 00:00:00 2018-12-03 00:00:00 _PARTITIONTIME 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 _PARTITIONDATE
  • 21. © 2018 KNOREX2 1 PARTITIONED TABLE Column1 Column2 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 Column3 Partitioned based on data in a specified TIMESTAMP or DATE column. SELECT Column1, Column2 FROM `database.table_name` WHERE Column3 >= "2018-12-01" AND Column3 < "2018-12-03"
  • 23. © 2018 KNOREX2 3 ARCHITECTURE – STREAMING PIPELINE Third-Party partner Processing and analytics CMS & RTB engine API gateway Cloud Load Balancing Data warehouse BigQuery Sharding + Clustering Stream proc Cloud Dataflow Autoscaling API Compute Engine Autoscaling Audience Cloud Bigtable 3 regions CMS Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic Segmented users Cloud Pub/Sub Device topic Python script Compute Engine Autoscaling Event ingest
  • 24. © 2018 KNOREX2 4 ARCHITECTURE – EVENT INGEST GCE run code with auto-scaling instances. it receives 1500 events a sec from our partner. API endpoint will put events into two separate topics: cookie and device. Cloud Load Balancing API Compute Engine Autoscaling Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic 1500 events a sec
  • 25. © 2018 KNOREX2 5 ARCHITECTURE – PROCESSING AND ANALYTICS 25 Cloud Dataflow transforms and enriches raw events in real time and inserts both processed data into BigQuery as well as send them to RTB engine through Pub/Sub. Each region has a subscription to pull data from segment topic, then insert into BigTable. BigQuery is a warehouse for analytics. Tables are partitioned by ingestion time. It keep data in 60 days. Data warehouse BigQuery Partition + Clustering Stream proc Cloud Dataflow Autoscaling Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic Segmented users Cloud Pub/Sub segment topic Asia region Compute Engine Cloud BigTable JP region Compute Engine Cloud BigTable US region Compute Engine Cloud BigTable CMS KNX RTB Engine
  • 26. © 2018 KNOREX2 6 ARCHITECTURE – BATCH PIPELINE The Dataflow also takes data from BigQuery in the past 30 days and reprocess again in batch job. Cloud Dataflow batch processing BigQuery analytics engine Batch pipeline Batch loads BigQuery analytics engine Pub/Sub
  • 27. © 2018 KNOREX2 7 DATAFLOW – PIPELINE VISUALIZATION