SlideShare a Scribd company logo
1 of 30
Download to read offline
BIG DATA PROCESSING WITH PUB/SUB,
DATAFLOW AND BIGQUERY
Thuyen Ho – Data Engineer @ KNOREX
© 2018 KNOREX
© 2018 KNOREX
Established in 2010, Knorex provides Precision Performance Marketing products and solutions to leading
trading desks, agencies and brands.
Offices and direct business presence across US, UK, Australia, China, India and Southeast Asia (SEA)
ABOUT KNOREX
8
OFFICES
110+
STAFFS
. .
.
. ....
© 2018 KNOREX
3
PROBLEM STATEMENT
Ingest large volume of streaming user data,
transform based on ever changing parameters, and
store them in a database in real time. This data will be
used for 2 purpose:
1. Targeting users in real time for advertising
campaigns
2. Aggregation of data for estimation of campaign
reach
Third-
party
partner
KNOREX
DMP
Ingest stream events
• QPS: ~1500 - 2000 events
• Event size: 50KB – 100KB
• Data Volume: ~1TB a day
Historical data
• Reprocess: ~30TB each day
• Aggregate: ~60TB each day
© 2018 KNOREX
4
• Quick Introduction To Pub/Sub, Dataflow and BigQuery
• KNOREX Approach
• Q&A
AGENDA
5
Quick Introduction To Pub/Sub, Dataflow and BigQuery
© 2018 KNOREX
6
SERVERLESS STREAM PROCESSING PIPELINE WITH GCP
Dataflow
stream processing
BigQuery
analytics
engine
Data events Processed data
Pub/Sub
messaging queue
© 2018 KNOREX
7
Cloud Pub/Sub is an asynchronous messaging service designed to be highly
reliable and scalable.
CLOUD PUB/SUB
© 2018 KNOREX
8
CLOUD PUB/SUB – PULL SUBSCRIPTION
© 2018 KNOREX
9
CLOUD PUB/SUB – PUSH SUBSCRIPTION
© 2018 KNOREX1
0
Lambda architecture is a data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch and stream-processing methods. (source:
wikipedia.org)
To balance:
• Latency
• Throughput
• Fault-tolerance
LAMBDA ARCHITECTURE
© 2018 KNOREX1
1
DATA PROCESSING - TRANSFORMS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing
© 2018 KNOREX1
2
Cloud Dataflow is a fully-managed service, autoscaling execution environment for
Beam pipelines.
Beams supports the following language-specific SDKs: Java, Python and Go
CLOUD DATAFLOW
Implement batch and streaming data
processing jobs that run on any
execution engine.
great execution environment
© 2018 KNOREX1
3
BEAM ABSTRACTIONS
Storage
Group Aggregate
Filter
Transform
Input Data Output Data
Data Processing
Bounded / Unbounded
PCollection
PTransform
PTransform
PTransform
PTransform
Pipeline
© 2018 KNOREX1
4
BEAM - FIXED TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1 30s window 2
© 2018 KNOREX1
5
BEAM – SLIDING TIME WINDOWS
1 7
2
1
8
Unbounded events
Processing time
3
8
6
3
5
3
8 8
2
4
2
1
9
3
7
30s window 0
00:00:00 00:00:30 00:01:00 00:01:30
30s window 1
30s window 2
© 2018 KNOREX1
6
BEAM – SESSION WINDOWS
1
2
Processing time
2
4
7
window 0
00:00:00 00:00:30 00:01:00 00:01:30
window 1 window 2
7
4
2
2 2 2
2 2
2
4
4 4
Gap duration
© 2018 KNOREX1
7
A fast, highly scalable, cost-effective, and fully managed enterprise data warehouse for
analytics.
Some of the features:
• Serverless
• Real-time Analytics
• Standard SQL
• Storage and Compute Separation
• Flexible Data Ingestion
• Petabyte Scale
CLOUD BIGQUERY
© 2018 KNOREX1
8
BIGQUERY STORAGE IS COLUMNAR
Column1 Column2 Column3
Each column in sperate. No
Indexes or key is required.
© 2018 KNOREX1
9
INGESTION-TIME PARTITIONED TABLE
19
Column1 Column2 Column3
SELECT Column1, Column2
FROM `database.table_name`
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE
© 2018 KNOREX2
0
INGESTION-TIME PARTITIONED TABLE
Column1 Column2 Column3
SELECT Column1, Column2
FROM `database.table_name`
WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03"
2018-12-01 00:00:00
2018-12-01 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-02 00:00:00
2018-12-03 00:00:00
2018-12-03 00:00:00
_PARTITIONTIME
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
_PARTITIONDATE
© 2018 KNOREX2
1
PARTITIONED TABLE
Column1 Column2
2018-12-01
2018-12-01
2018-12-02
2018-12-02
2018-12-02
2018-12-03
2018-12-03
Column3
Partitioned based on data in a
specified TIMESTAMP or DATE
column.
SELECT Column1, Column2
FROM `database.table_name`
WHERE Column3 >= "2018-12-01" AND Column3 < "2018-12-03"
22
KNOREX APPROACH
© 2018 KNOREX2
3
ARCHITECTURE – STREAMING PIPELINE
Third-Party partner Processing and analytics CMS
& RTB engine
API gateway
Cloud Load
Balancing
Data warehouse
BigQuery
Sharding +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
API
Compute Engine
Autoscaling
Audience
Cloud Bigtable
3 regions
CMS
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
Device topic
Python script
Compute Engine
Autoscaling
Event ingest
© 2018 KNOREX2
4
ARCHITECTURE – EVENT INGEST
GCE run code with auto-scaling
instances.
it receives 1500 events a sec from
our partner.
API endpoint will put events into two
separate topics: cookie and device.
Cloud Load
Balancing
API
Compute Engine
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
1500 events a sec
© 2018 KNOREX2
5
ARCHITECTURE – PROCESSING AND ANALYTICS
25
Cloud Dataflow transforms and
enriches raw events in real time
and inserts both processed data
into BigQuery as well as send them
to RTB engine through Pub/Sub.
Each region has a subscription to
pull data from segment topic, then
insert into BigTable.
BigQuery is a warehouse for
analytics. Tables are partitioned by
ingestion time. It keep data in 60
days.
Data warehouse
BigQuery
Partition +
Clustering
Stream proc
Cloud Dataflow
Autoscaling
Cookie
Cloud Pub/Sub
Cookie topic
Device
Cloud Pub/Sub
Device topic
Segmented users
Cloud Pub/Sub
segment topic Asia region
Compute
Engine
Cloud
BigTable
JP region
Compute
Engine
Cloud
BigTable
US region
Compute
Engine
Cloud
BigTable
CMS
KNX RTB Engine
© 2018 KNOREX2
6
ARCHITECTURE – BATCH PIPELINE
The Dataflow also takes data
from BigQuery in the past 30
days and reprocess again in
batch job.
Cloud Dataflow
batch processing
BigQuery
analytics
engine
Batch pipeline Batch loads
BigQuery
analytics
engine
Pub/Sub
© 2018 KNOREX2
7
DATAFLOW – PIPELINE VISUALIZATION
28
Q&A
29
Building Resilient Streaming Systems Lab
30
THANK YOU
KNOR E X.COM

More Related Content

What's hot

Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL confluent
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...Big Data Spain
 
How to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jHow to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jGraphRM
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceDataWorks Summit
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices confluent
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringInfluxData
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeSingleStore
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Big Data Spain
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Jedha Bootcamp
 
Google cloud big data summit master gcp big data summit la - 10-20-2015
Google cloud big data summit   master gcp big data summit la - 10-20-2015Google cloud big data summit   master gcp big data summit la - 10-20-2015
Google cloud big data summit master gcp big data summit la - 10-20-2015Raj Babu
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondSingleStore
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeSingleStore
 

What's hot (20)

Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
Serving the Real-Time Data Needs of an Airport with Kafka Streams and KSQL
 
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 Migration and Coexistence between Relational and NoSQL Databases by Manuel H... Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
Migration and Coexistence between Relational and NoSQL Databases by Manuel H...
 
How to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4jHow to leverage Kafka data streams with Neo4j
How to leverage Kafka data streams with Neo4j
 
IoT at Google Scale
IoT at Google ScaleIoT at Google Scale
IoT at Google Scale
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open SourceHigh Performance and Scalable Geospatial Analytics on Cloud with Open Source
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices Kafka as an Eventing System to Replatform a Monolith into Microservices
Kafka as an Eventing System to Replatform a Monolith into Microservices
 
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint MonitoringGain Deep Visibility into APIs and Integrations with Anypoint Monitoring
Gain Deep Visibility into APIs and Integrations with Anypoint Monitoring
 
Building the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free LifeBuilding the Foundation for a Latency-Free Life
Building the Foundation for a Latency-Free Life
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage Les objets connectés : de nombreux cas d'usage
Les objets connectés : de nombreux cas d'usage
 
Google cloud big data summit master gcp big data summit la - 10-20-2015
Google cloud big data summit   master gcp big data summit la - 10-20-2015Google cloud big data summit   master gcp big data summit la - 10-20-2015
Google cloud big data summit master gcp big data summit la - 10-20-2015
 
The State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and BeyondThe State of the Data Warehouse in 2017 and Beyond
The State of the Data Warehouse in 2017 and Beyond
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Five ways database modernization simplifies your data life
Five ways database modernization simplifies your data lifeFive ways database modernization simplifies your data life
Five ways database modernization simplifies your data life
 

Similar to Big data processing with PubSub, Dataflow, and BigQuery

Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeSoftware Guru
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation Brett VanderPlaats
 
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseData Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseRittman Analytics
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraAttunity
 
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018Gleb Otochkin
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Amazon Web Services
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...HostedbyConfluent
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageBuilding Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageDatabricks
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsSingleStore
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage Charly Mostert
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationDatabricks
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Cloudera, Inc.
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeKent Graziano
 
Veritas + MongoDB
Veritas + MongoDBVeritas + MongoDB
Veritas + MongoDBMongoDB
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Denodo
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 

Similar to Big data processing with PubSub, Dataflow, and BigQuery (20)

Laboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nubeLaboratorio práctico: Data warehouse en la nube
Laboratorio práctico: Data warehouse en la nube
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data WarehouseData Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
Data Warehouse Like a Tech Startup with Oracle Autonomous Data Warehouse
 
Digital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming EraDigital Business Transformation in the Streaming Era
Digital Business Transformation in the Streaming Era
 
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
One bridge to connect them all. Oracle GoldenGate for Big Data.UKOUG Tech 2018
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...Comparing three data ingestion approaches where Apache Kafka integrates with ...
Comparing three data ingestion approaches where Apache Kafka integrates with ...
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and StorageBuilding Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
Building Resilient and Scalable Data Pipelines by Decoupling Compute and Storage
 
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive AnalyticsThe Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
The Real-Time CDO and the Cloud-Forward Path to Predictive Analytics
 
How Financial Services can Save On File Storage
How Financial Services can Save On File Storage How Financial Services can Save On File Storage
How Financial Services can Save On File Storage
 
The Hidden Value of Hadoop Migration
The Hidden Value of Hadoop MigrationThe Hidden Value of Hadoop Migration
The Hidden Value of Hadoop Migration
 
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
Turning Petabytes of Data into Profit with Hadoop for the World’s Biggest Ret...
 
Delivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with SnowflakeDelivering Data Democratization in the Cloud with Snowflake
Delivering Data Democratization in the Cloud with Snowflake
 
Veritas + MongoDB
Veritas + MongoDBVeritas + MongoDB
Veritas + MongoDB
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 

Recently uploaded

Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictJack Cole
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Calllward7
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理pyhepag
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonPayment Village
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentationanshikakulshreshtha11
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsCEPTES Software Inc
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdfvyankatesh1
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxStephen266013
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理pyhepag
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理cyebo
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxDilipVasan
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp onlinebalibahu1313
 

Recently uploaded (20)

Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Data analytics courses in Nepal Presentation
Data analytics courses in Nepal PresentationData analytics courses in Nepal Presentation
Data analytics courses in Nepal Presentation
 
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPsWebinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
Webinar One View, Multiple Systems No-Code Integration of Salesforce and ERPs
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 

Big data processing with PubSub, Dataflow, and BigQuery

  • 1. BIG DATA PROCESSING WITH PUB/SUB, DATAFLOW AND BIGQUERY Thuyen Ho – Data Engineer @ KNOREX © 2018 KNOREX
  • 2. © 2018 KNOREX Established in 2010, Knorex provides Precision Performance Marketing products and solutions to leading trading desks, agencies and brands. Offices and direct business presence across US, UK, Australia, China, India and Southeast Asia (SEA) ABOUT KNOREX 8 OFFICES 110+ STAFFS . . . . ....
  • 3. © 2018 KNOREX 3 PROBLEM STATEMENT Ingest large volume of streaming user data, transform based on ever changing parameters, and store them in a database in real time. This data will be used for 2 purpose: 1. Targeting users in real time for advertising campaigns 2. Aggregation of data for estimation of campaign reach Third- party partner KNOREX DMP Ingest stream events • QPS: ~1500 - 2000 events • Event size: 50KB – 100KB • Data Volume: ~1TB a day Historical data • Reprocess: ~30TB each day • Aggregate: ~60TB each day
  • 4. © 2018 KNOREX 4 • Quick Introduction To Pub/Sub, Dataflow and BigQuery • KNOREX Approach • Q&A AGENDA
  • 5. 5 Quick Introduction To Pub/Sub, Dataflow and BigQuery
  • 6. © 2018 KNOREX 6 SERVERLESS STREAM PROCESSING PIPELINE WITH GCP Dataflow stream processing BigQuery analytics engine Data events Processed data Pub/Sub messaging queue
  • 7. © 2018 KNOREX 7 Cloud Pub/Sub is an asynchronous messaging service designed to be highly reliable and scalable. CLOUD PUB/SUB
  • 8. © 2018 KNOREX 8 CLOUD PUB/SUB – PULL SUBSCRIPTION
  • 9. © 2018 KNOREX 9 CLOUD PUB/SUB – PUSH SUBSCRIPTION
  • 10. © 2018 KNOREX1 0 Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. (source: wikipedia.org) To balance: • Latency • Throughput • Fault-tolerance LAMBDA ARCHITECTURE
  • 11. © 2018 KNOREX1 1 DATA PROCESSING - TRANSFORMS Storage Group Aggregate Filter Transform Input Data Output Data Data Processing
  • 12. © 2018 KNOREX1 2 Cloud Dataflow is a fully-managed service, autoscaling execution environment for Beam pipelines. Beams supports the following language-specific SDKs: Java, Python and Go CLOUD DATAFLOW Implement batch and streaming data processing jobs that run on any execution engine. great execution environment
  • 13. © 2018 KNOREX1 3 BEAM ABSTRACTIONS Storage Group Aggregate Filter Transform Input Data Output Data Data Processing Bounded / Unbounded PCollection PTransform PTransform PTransform PTransform Pipeline
  • 14. © 2018 KNOREX1 4 BEAM - FIXED TIME WINDOWS 1 7 2 1 8 Unbounded events Processing time 3 8 6 3 5 3 8 8 2 4 2 1 9 3 7 30s window 0 00:00:00 00:00:30 00:01:00 00:01:30 30s window 1 30s window 2
  • 15. © 2018 KNOREX1 5 BEAM – SLIDING TIME WINDOWS 1 7 2 1 8 Unbounded events Processing time 3 8 6 3 5 3 8 8 2 4 2 1 9 3 7 30s window 0 00:00:00 00:00:30 00:01:00 00:01:30 30s window 1 30s window 2
  • 16. © 2018 KNOREX1 6 BEAM – SESSION WINDOWS 1 2 Processing time 2 4 7 window 0 00:00:00 00:00:30 00:01:00 00:01:30 window 1 window 2 7 4 2 2 2 2 2 2 2 4 4 4 Gap duration
  • 17. © 2018 KNOREX1 7 A fast, highly scalable, cost-effective, and fully managed enterprise data warehouse for analytics. Some of the features: • Serverless • Real-time Analytics • Standard SQL • Storage and Compute Separation • Flexible Data Ingestion • Petabyte Scale CLOUD BIGQUERY
  • 18. © 2018 KNOREX1 8 BIGQUERY STORAGE IS COLUMNAR Column1 Column2 Column3 Each column in sperate. No Indexes or key is required.
  • 19. © 2018 KNOREX1 9 INGESTION-TIME PARTITIONED TABLE 19 Column1 Column2 Column3 SELECT Column1, Column2 FROM `database.table_name` WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03" 2018-12-01 00:00:00 2018-12-01 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-03 00:00:00 2018-12-03 00:00:00 _PARTITIONTIME 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 _PARTITIONDATE
  • 20. © 2018 KNOREX2 0 INGESTION-TIME PARTITIONED TABLE Column1 Column2 Column3 SELECT Column1, Column2 FROM `database.table_name` WHERE PARTITIONDATE >= "2018-12-01" AND _PARTITIONDATE < "2018-12-03" 2018-12-01 00:00:00 2018-12-01 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-02 00:00:00 2018-12-03 00:00:00 2018-12-03 00:00:00 _PARTITIONTIME 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 _PARTITIONDATE
  • 21. © 2018 KNOREX2 1 PARTITIONED TABLE Column1 Column2 2018-12-01 2018-12-01 2018-12-02 2018-12-02 2018-12-02 2018-12-03 2018-12-03 Column3 Partitioned based on data in a specified TIMESTAMP or DATE column. SELECT Column1, Column2 FROM `database.table_name` WHERE Column3 >= "2018-12-01" AND Column3 < "2018-12-03"
  • 23. © 2018 KNOREX2 3 ARCHITECTURE – STREAMING PIPELINE Third-Party partner Processing and analytics CMS & RTB engine API gateway Cloud Load Balancing Data warehouse BigQuery Sharding + Clustering Stream proc Cloud Dataflow Autoscaling API Compute Engine Autoscaling Audience Cloud Bigtable 3 regions CMS Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic Segmented users Cloud Pub/Sub Device topic Python script Compute Engine Autoscaling Event ingest
  • 24. © 2018 KNOREX2 4 ARCHITECTURE – EVENT INGEST GCE run code with auto-scaling instances. it receives 1500 events a sec from our partner. API endpoint will put events into two separate topics: cookie and device. Cloud Load Balancing API Compute Engine Autoscaling Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic 1500 events a sec
  • 25. © 2018 KNOREX2 5 ARCHITECTURE – PROCESSING AND ANALYTICS 25 Cloud Dataflow transforms and enriches raw events in real time and inserts both processed data into BigQuery as well as send them to RTB engine through Pub/Sub. Each region has a subscription to pull data from segment topic, then insert into BigTable. BigQuery is a warehouse for analytics. Tables are partitioned by ingestion time. It keep data in 60 days. Data warehouse BigQuery Partition + Clustering Stream proc Cloud Dataflow Autoscaling Cookie Cloud Pub/Sub Cookie topic Device Cloud Pub/Sub Device topic Segmented users Cloud Pub/Sub segment topic Asia region Compute Engine Cloud BigTable JP region Compute Engine Cloud BigTable US region Compute Engine Cloud BigTable CMS KNX RTB Engine
  • 26. © 2018 KNOREX2 6 ARCHITECTURE – BATCH PIPELINE The Dataflow also takes data from BigQuery in the past 30 days and reprocess again in batch job. Cloud Dataflow batch processing BigQuery analytics engine Batch pipeline Batch loads BigQuery analytics engine Pub/Sub
  • 27. © 2018 KNOREX2 7 DATAFLOW – PIPELINE VISUALIZATION