Build Your Own Event
Analytics Pipeline Using BigQuery,
Dataflow, and K8s
Aviv Laufer
Principal Reliability Engineer , DoiT International
@avivl
Google’s Premier
MSP Partner helping
startups around the
globe with cloud
engineering &
cost optimization
Autoscaling Hadoop
and Spark on top of
Google Dataproc
Opinionated Event Analytics
Pipeline built on top
of Dataflow
Park non-production
instances and save ±60% on
Google Compute Engine
Collaborate with peers and
other teams on configuration
changes in Google Cloud
The most advanced
cost-optimization platform
for Google Cloud
Where everyone starts..
Off-the-shelf event analytics
Not as flexible as we’d like
Linear cost of $/event
We don’t own the data
Flexible
Unlimited aggregations and
joins on our own data w/ BI
tool of our choice
Lower cost at scale
Cost per event should
decrease as we stream more
events to the system
Global
Short latencies for most of the
users regardless of their location
Event analytics pipeline v2.0
Architecture
Streaming
Batch Immutable data
BigQuery
Log data
Cloud storage
Data processing
Cloud Dataflow
Async messaging
Cloud Pub/Sub
Gaming logs
Batch load
Real-Time events
Multiple platforms
Report & share
Business analysis
Kubernetes cluster
Kubernetes Engine
Events APIs
Mutable events
Cloud BigTable
Batch
Streaming
Batch
Architecture
Immutable data
BigQuery
Log data
Cloud storage
Data processing
Cloud Dataflow
Async messaging
Cloud Pub/Sub
Gaming logs
Batch load
Real-Time events
Multiple platforms
Report & share
Business analysis
Kubernetes cluster
Kubernetes Engine
Events APIs
Mutable events
Cloud BigTable
Batch
Streaming
Batch
Architecture
Immutable data
BigQuery
Log data
Cloud storage
Data processing
Cloud Dataflow
Async messaging
Cloud Sub/Sub
Gaming logs
Batch load
Real-Time events
Multiple platforms
Report & share
Business analysis
Kubernetes cluster
Kubernetes Engine
Events APIs
Mutable events
Cloud BigTable
Batch
Streaming
Batch
Architecture
Immutable data
BigQuery
Log data
Cloud storage
Data processing
Cloud Dataflow
Async messaging
Cloud Pub/Sub
Gaming logs
Batch load
Real-Time events
Multiple platforms
Report & share
Business analysis
Kubernetes cluster
Kubernetes Engine
Events APIs
Mutable events
Cloud BigTable
1 Event ingestion
Events API
Streaming
Async messaging
Cloud Pub/Sub
us-central1
Kubernetes cluster
Kubernetes Engine
us-cluster
us-central1-a
us-central1-f
us-central1-c
eu-west1-b
eu-west1-d
eu-west1-c
Kubernetes cluster
Kubernetes Engine
eu-cluster
HTTPS
Load balancer
us-central1
eu-west1
Additional region/s
Additional region/s
Kubernetes Federation Plane
Events API
Cloud EndpointsReal-Time events
Multiple platforms
eu-west1
Event ingestion
Latency distribution
(95th percentile)
North America: 89ms
West Europe: 54ms
Without GKE cluster in asia-east1
250ms
With GKE cluster in asia-east1
61ms (75% improvement!)
Global
Managed real-time
messaging
Google Cloud Endpoints
helps to protect and
monitor our APIs.
Authentication
Rate control
Monitoring
Events API
Cloud Endpoints
Android
Web
Endpoint
Clients
Name
Kubernetes Engine
iOS
Google Cloud Endpoints
Managed real-time
messaging
Cloud Pub/Sub delivers
each event to every
subscription at least once.
Publisher
Topic
Message
Cloud Pub/Sub Subscription
Subscriber
Pull or
push
Google Cloud Pub/Sub
Message
Ack
2 Event processing
Filtering, aggregation and
grouping of events
Event processing w/ Apache Beam 2.x
Modern Cloud-based ETL
OSS as Apache Beam
Autoscaling
Unified batch & streaming
Java & Python-based SDK
Integrated with GCP
Runs on Spark, Flink & GCP
Event processing
Group 1
Transform 1
Write
Read
Filter 1
Pub/Sub
BigQuery
Pipeline p = Pipeline.create();
p
p.run();
.apply(ParDo.of(new Filter1()))
.apply(new Transform1())
.apply(new Group1())
.apply(BigQueryIO.write().to(“…”));
.apply(PubsubIO.read().from(“…”))
Event processing
Event time based windows
11:0010:00 15:0014:0013:0012:00Event time
11:0010:00 15:0014:0013:0012:00Processing time
Input
Output
Event processing
Cloud Dataflow 2.6
Dynamic
destinations
Automatic
schema
detection
Shuffle service
Column-based
partitioning
Data Ingest
Async messaging
Cloud pub/sub
Immutable data
BigQuery
Mutable events
Cloud bigtable
Dataflow/beam
Cloud dataflow
Relocate
Dataflow/beam
Cloud dataflow
Mutable data
The life of event
Some data
may change.
Some events
are immutable.
3 Event analytics
Analyzing billions of events at scale
SQL:2011
Compliant
Petabit Network
High-available
cluster compute
(Dremel)Streaming
ingest
Free bulk
loading
Replicated,
distributed storage
(99.9999999999% durability) REST API
Client
libraries
in 10
languages
Web UI, CLIDistributed
memory
shuffle tier
Event analytics with Google BigQuery
BigQuery
Benefits
● Improve onboarding experience
● Fast release cycle
● Identifying our most value users
● Improve KPI
4 Cost analytics
Designed for low total cost
of ownership
Cost analytics w/ reOptimize.io
Cost analytics w/ reOptimize.io
1.3B
3.3B
6.0B
2 wk
Planning / MVP Coding
1 wk
Testing
1 wk
Launching
1 wk
Project duration
Open sourcing Banias
Opinionated serverless event analytics pipeline
github.com/doitintl/banias
Deployable in just 1 hour
Elastic schemas
References
Suggested reading
Building a Mobile Gaming Analytics Platform - a Reference Architecture
How to handle mutating JSON schemas in a streaming pipeline
Google Cloud Analytics with reoptimize.io
github.com/doitintl/banias
blog.doit-intl.com
Q&A
Thank you

Build your own event analytics pipeline using BigQuery, Dataflow, and k8s. JellyButton case study.

  • 1.
    Build Your OwnEvent Analytics Pipeline Using BigQuery, Dataflow, and K8s Aviv Laufer Principal Reliability Engineer , DoiT International @avivl
  • 2.
    Google’s Premier MSP Partnerhelping startups around the globe with cloud engineering & cost optimization Autoscaling Hadoop and Spark on top of Google Dataproc Opinionated Event Analytics Pipeline built on top of Dataflow Park non-production instances and save ±60% on Google Compute Engine Collaborate with peers and other teams on configuration changes in Google Cloud The most advanced cost-optimization platform for Google Cloud
  • 3.
    Where everyone starts.. Off-the-shelfevent analytics Not as flexible as we’d like Linear cost of $/event We don’t own the data
  • 4.
    Flexible Unlimited aggregations and joinson our own data w/ BI tool of our choice Lower cost at scale Cost per event should decrease as we stream more events to the system Global Short latencies for most of the users regardless of their location Event analytics pipeline v2.0
  • 5.
    Architecture Streaming Batch Immutable data BigQuery Logdata Cloud storage Data processing Cloud Dataflow Async messaging Cloud Pub/Sub Gaming logs Batch load Real-Time events Multiple platforms Report & share Business analysis Kubernetes cluster Kubernetes Engine Events APIs Mutable events Cloud BigTable
  • 6.
    Batch Streaming Batch Architecture Immutable data BigQuery Log data Cloudstorage Data processing Cloud Dataflow Async messaging Cloud Pub/Sub Gaming logs Batch load Real-Time events Multiple platforms Report & share Business analysis Kubernetes cluster Kubernetes Engine Events APIs Mutable events Cloud BigTable
  • 7.
    Batch Streaming Batch Architecture Immutable data BigQuery Log data Cloudstorage Data processing Cloud Dataflow Async messaging Cloud Sub/Sub Gaming logs Batch load Real-Time events Multiple platforms Report & share Business analysis Kubernetes cluster Kubernetes Engine Events APIs Mutable events Cloud BigTable
  • 8.
    Batch Streaming Batch Architecture Immutable data BigQuery Log data Cloudstorage Data processing Cloud Dataflow Async messaging Cloud Pub/Sub Gaming logs Batch load Real-Time events Multiple platforms Report & share Business analysis Kubernetes cluster Kubernetes Engine Events APIs Mutable events Cloud BigTable
  • 9.
  • 10.
    Streaming Async messaging Cloud Pub/Sub us-central1 Kubernetescluster Kubernetes Engine us-cluster us-central1-a us-central1-f us-central1-c eu-west1-b eu-west1-d eu-west1-c Kubernetes cluster Kubernetes Engine eu-cluster HTTPS Load balancer us-central1 eu-west1 Additional region/s Additional region/s Kubernetes Federation Plane Events API Cloud EndpointsReal-Time events Multiple platforms eu-west1 Event ingestion
  • 11.
    Latency distribution (95th percentile) NorthAmerica: 89ms West Europe: 54ms Without GKE cluster in asia-east1 250ms With GKE cluster in asia-east1 61ms (75% improvement!) Global
  • 12.
    Managed real-time messaging Google CloudEndpoints helps to protect and monitor our APIs. Authentication Rate control Monitoring Events API Cloud Endpoints Android Web Endpoint Clients Name Kubernetes Engine iOS Google Cloud Endpoints
  • 13.
    Managed real-time messaging Cloud Pub/Subdelivers each event to every subscription at least once. Publisher Topic Message Cloud Pub/Sub Subscription Subscriber Pull or push Google Cloud Pub/Sub Message Ack
  • 14.
    2 Event processing Filtering,aggregation and grouping of events
  • 15.
    Event processing w/Apache Beam 2.x Modern Cloud-based ETL OSS as Apache Beam Autoscaling Unified batch & streaming Java & Python-based SDK Integrated with GCP Runs on Spark, Flink & GCP
  • 16.
    Event processing Group 1 Transform1 Write Read Filter 1 Pub/Sub BigQuery Pipeline p = Pipeline.create(); p p.run(); .apply(ParDo.of(new Filter1())) .apply(new Transform1()) .apply(new Group1()) .apply(BigQueryIO.write().to(“…”)); .apply(PubsubIO.read().from(“…”))
  • 17.
  • 18.
    Event time basedwindows 11:0010:00 15:0014:0013:0012:00Event time 11:0010:00 15:0014:0013:0012:00Processing time Input Output
  • 19.
    Event processing Cloud Dataflow2.6 Dynamic destinations Automatic schema detection Shuffle service Column-based partitioning
  • 20.
    Data Ingest Async messaging Cloudpub/sub Immutable data BigQuery Mutable events Cloud bigtable Dataflow/beam Cloud dataflow Relocate Dataflow/beam Cloud dataflow Mutable data The life of event Some data may change. Some events are immutable.
  • 21.
    3 Event analytics Analyzingbillions of events at scale
  • 22.
    SQL:2011 Compliant Petabit Network High-available cluster compute (Dremel)Streaming ingest Freebulk loading Replicated, distributed storage (99.9999999999% durability) REST API Client libraries in 10 languages Web UI, CLIDistributed memory shuffle tier Event analytics with Google BigQuery BigQuery
  • 23.
    Benefits ● Improve onboardingexperience ● Fast release cycle ● Identifying our most value users ● Improve KPI
  • 24.
    4 Cost analytics Designedfor low total cost of ownership
  • 25.
    Cost analytics w/reOptimize.io
  • 26.
    Cost analytics w/reOptimize.io 1.3B 3.3B 6.0B
  • 27.
    2 wk Planning /MVP Coding 1 wk Testing 1 wk Launching 1 wk Project duration
  • 28.
    Open sourcing Banias Opinionatedserverless event analytics pipeline github.com/doitintl/banias Deployable in just 1 hour Elastic schemas
  • 29.
    References Suggested reading Building aMobile Gaming Analytics Platform - a Reference Architecture How to handle mutating JSON schemas in a streaming pipeline Google Cloud Analytics with reoptimize.io github.com/doitintl/banias blog.doit-intl.com
  • 30.
  • 31.